On 2025/10/22 12:30, Dmitry Osipenko wrote:
On 10/17/25 03:40, Akihiko Odaki wrote:
On 2025/10/17 8:43, Akihiko Odaki wrote:
On 2025/10/17 4:33, Dmitry Osipenko wrote:
On 10/16/25 09:34, Akihiko Odaki wrote:
- /* Wait for one thread to report a quiescent state and try
again.
+ /*
+ * Sleep for a while and try again.
* Release rcu_registry_lock, so rcu_(un)register_thread()
doesn't
* wait too much time.
*
@@ -133,7 +150,20 @@ static void wait_for_readers(void)
* rcu_registry_lock is released.
*/
qemu_mutex_unlock(&rcu_registry_lock);
- qemu_event_wait(&rcu_gp_event);
+
+ if (forced) {
+ qemu_event_wait(&rcu_gp_event);
+
+ /*
+ * We want to be notified of changes made to
rcu_gp_ongoing
+ * while we walk the list.
+ */
+ qemu_event_reset(&rcu_gp_event);
+ } else {
+ g_usleep(10000);
+ sleeps++;
Thanks a lot for this RCU improvement. It indeed removes the hard stalls
with unmapping of virtio-gpu blobs.
Am I understanding correctly that potentially we will be hitting this
g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
MemoryRegion patches from Alex [1] are still needed to avoid stalls
entirely.
[1]
https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
[email protected]/
That is right, but "avoiding stalls entirely" also causes use-after-free.
The problem with virtio-gpu on TCG is that TCG keeps using the old
memory map until force_rcu is triggered. So, without force_rcu, the
following pseudo-code on a guest will result in use-after-free:
address = blob_map(resource_id);
blob_unmap(resource_id);
for (i = 0; i < some_big_number; i++)
*(uint8_t *)address = 0;
*(uint8_t *)address will dereference the blob until force_rcu is
triggered, so finalizing MemoryRegion before force_rcu results in use-
after-free.
The best option to eliminate the delay entirely I have in mind is to
call drain_call_rcu(), but I'm not for such a change (for now).
drain_call_rcu() eliminates the delay if the FlatView protected by RCU
is the only referrer of the MemoryRegion, but that is not guaranteed.
Performance should not be a concern anyway in this situation. The
guest should not waste CPU time by polling in the first place if you
really care performance; since it's a para-virtualized device and not
a real hardware, CPU time may be shared between the guest and the
device, and thus polling on the guest has an inherent risk of slowing
down the device. For performance-sensitive workloads, the guest should:
- avoid polling and
- accumulate commands instead of waiting for each
The delay will be less problematic if the guest does so, and I think
at least Linux does avoid polling.
That said, stalling the guest forever in this situation is "wrong" (!=
"bad performance"). I wrote this patch to guarantee forward progress,
which is mandatory for semantic correctness.
Perhaps drain_call_rcu() may make sense also in other, performance-
sensitive scenarios, but it should be added after benchmark or we will
have a immature optimization.
I first thought just adding drain_call_rcu() would work but apparently
it is not that simple. Adding drain_call_rcu() has a few problems:
- It drops the BQL, which should be avoided. Problems caused by
run_on_cpu(), which drops the BQL, was discussed on the list for a few
times and drain_call_rcu() may also suffer from them.
- It is less effective if the RCU thread enters g_usleep() before
drain_call_rcu() is called.
- It slows down readers due to the nature of drain_call_rcu().
So, if you know some workload that may suffer from the delay, it may be
a good idea to try them with the patches from Alex first, and then think
of a clean solution if it improves performance.
What's not clear to me is whether this use-after-free problem affects
only TCG or KVM too.
Can we unmap blob immediately using Alex' method when using KVM? Would
be great if we could optimize unmapping for KVM. TCG performance is a
much lesser concern.
Unfortunately no, the use-after-free can occur whatever acceleration is
used. It was discussed thoroughly in the past. In short, the patch
breaks reference counting (which also breaks RCU as it is indirectly
tracked with reference counting) so use-after-free can be triggered with
a DMA operation.
The best summary I can find is the following:
https://lore.kernel.org/qemu-devel/[email protected]/
The breakage of reference counting can be checked with the code in the
following email:
https://lore.kernel.org/qemu-devel/[email protected]/
I suggested adding an RCU API to request the force quiescent
state for virtio-gpu as a solution that works both on TCG and KVM in
another thread, but I think it is also the easiest solution for KVM that
avoids a regression. The thread can be found at:
https://lore.kernel.org/qemu-devel/[email protected]/
I think it is straightforward enough, but I'm worried that it may not be
implemented before the feature freeze. I am now just a hobbyist that has
little time to spend for QEMU, and RCU and the memory region management
are too important to make a change in haste.
Regards,
Akihiko Odaki