On 10/17/25 03:40, Akihiko Odaki wrote:
> On 2025/10/17 8:43, Akihiko Odaki wrote:
>> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>>> - /* Wait for one thread to report a quiescent state and try
>>>> again.
>>>> + /*
>>>> + * Sleep for a while and try again.
>>>> * Release rcu_registry_lock, so rcu_(un)register_thread()
>>>> doesn't
>>>> * wait too much time.
>>>> *
>>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>> * rcu_registry_lock is released.
>>>> */
>>>> qemu_mutex_unlock(&rcu_registry_lock);
>>>> - qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> + if (forced) {
>>>> + qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> + /*
>>>> + * We want to be notified of changes made to
>>>> rcu_gp_ongoing
>>>> + * while we walk the list.
>>>> + */
>>>> + qemu_event_reset(&rcu_gp_event);
>>>> + } else {
>>>> + g_usleep(10000);
>>>> + sleeps++;
>>>
>>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>>> with unmapping of virtio-gpu blobs.
>>>
>>> Am I understanding correctly that potentially we will be hitting this
>>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>>> entirely.
>>>
>>> [1]
>>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
>>> [email protected]/
>>
>> That is right, but "avoiding stalls entirely" also causes use-after-free.
>>
>> The problem with virtio-gpu on TCG is that TCG keeps using the old
>> memory map until force_rcu is triggered. So, without force_rcu, the
>> following pseudo-code on a guest will result in use-after-free:
>>
>> address = blob_map(resource_id);
>> blob_unmap(resource_id);
>>
>> for (i = 0; i < some_big_number; i++)
>> *(uint8_t *)address = 0;
>>
>> *(uint8_t *)address will dereference the blob until force_rcu is
>> triggered, so finalizing MemoryRegion before force_rcu results in use-
>> after-free.
>>
>> The best option to eliminate the delay entirely I have in mind is to
>> call drain_call_rcu(), but I'm not for such a change (for now).
>> drain_call_rcu() eliminates the delay if the FlatView protected by RCU
>> is the only referrer of the MemoryRegion, but that is not guaranteed.
>>
>> Performance should not be a concern anyway in this situation. The
>> guest should not waste CPU time by polling in the first place if you
>> really care performance; since it's a para-virtualized device and not
>> a real hardware, CPU time may be shared between the guest and the
>> device, and thus polling on the guest has an inherent risk of slowing
>> down the device. For performance-sensitive workloads, the guest should:
>>
>> - avoid polling and
>> - accumulate commands instead of waiting for each
>>
>> The delay will be less problematic if the guest does so, and I think
>> at least Linux does avoid polling.
>>
>> That said, stalling the guest forever in this situation is "wrong" (!=
>> "bad performance"). I wrote this patch to guarantee forward progress,
>> which is mandatory for semantic correctness.
>>
>> Perhaps drain_call_rcu() may make sense also in other, performance-
>> sensitive scenarios, but it should be added after benchmark or we will
>> have a immature optimization.
>
> I first thought just adding drain_call_rcu() would work but apparently
> it is not that simple. Adding drain_call_rcu() has a few problems:
>
> - It drops the BQL, which should be avoided. Problems caused by
> run_on_cpu(), which drops the BQL, was discussed on the list for a few
> times and drain_call_rcu() may also suffer from them.
>
> - It is less effective if the RCU thread enters g_usleep() before
> drain_call_rcu() is called.
>
> - It slows down readers due to the nature of drain_call_rcu().
>
> So, if you know some workload that may suffer from the delay, it may be
> a good idea to try them with the patches from Alex first, and then think
> of a clean solution if it improves performance.
What's not clear to me is whether this use-after-free problem affects
only TCG or KVM too.
Can we unmap blob immediately using Alex' method when using KVM? Would
be great if we could optimize unmapping for KVM. TCG performance is a
much lesser concern.
--
Best regards,
Dmitry