On 10/17/25 03:40, Akihiko Odaki wrote:
> On 2025/10/17 8:43, Akihiko Odaki wrote:
>> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>>> -        /* Wait for one thread to report a quiescent state and try
>>>> again.
>>>> +        /*
>>>> +         * Sleep for a while and try again.
>>>>            * Release rcu_registry_lock, so rcu_(un)register_thread()
>>>> doesn't
>>>>            * wait too much time.
>>>>            *
>>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>>            * rcu_registry_lock is released.
>>>>            */
>>>>           qemu_mutex_unlock(&rcu_registry_lock);
>>>> -        qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> +        if (forced) {
>>>> +            qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> +            /*
>>>> +             * We want to be notified of changes made to
>>>> rcu_gp_ongoing
>>>> +             * while we walk the list.
>>>> +             */
>>>> +            qemu_event_reset(&rcu_gp_event);
>>>> +        } else {
>>>> +            g_usleep(10000);
>>>> +            sleeps++;
>>>
>>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>>> with unmapping of virtio-gpu blobs.
>>>
>>> Am I understanding correctly that potentially we will be hitting this
>>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>>> entirely.
>>>
>>> [1]
>>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
>>> [email protected]/
>>
>> That is right, but "avoiding stalls entirely" also causes use-after-free.
>>
>> The problem with virtio-gpu on TCG is that TCG keeps using the old
>> memory map until force_rcu is triggered. So, without force_rcu, the
>> following pseudo-code on a guest will result in use-after-free:
>>
>> address = blob_map(resource_id);
>> blob_unmap(resource_id);
>>
>> for (i = 0; i < some_big_number; i++)
>>    *(uint8_t *)address = 0;
>>
>> *(uint8_t *)address will dereference the blob until force_rcu is
>> triggered, so finalizing MemoryRegion before force_rcu results in use-
>> after-free.
>>
>> The best option to eliminate the delay entirely I have in mind is to
>> call drain_call_rcu(), but I'm not for such a change (for now).
>> drain_call_rcu() eliminates the delay if the FlatView protected by RCU
>> is the only referrer of the MemoryRegion, but that is not guaranteed.
>>
>> Performance should not be a concern anyway in this situation. The
>> guest should not waste CPU time by polling in the first place if you
>> really care performance; since it's a para-virtualized device and not
>> a real hardware, CPU time may be shared between the guest and the
>> device, and thus polling on the guest has an inherent risk of slowing
>> down the device. For performance-sensitive workloads, the guest should:
>>
>> - avoid polling and
>> - accumulate commands instead of waiting for each
>>
>> The delay will be less problematic if the guest does so, and I think
>> at least Linux does avoid polling.
>>
>> That said, stalling the guest forever in this situation is "wrong" (!=
>> "bad performance"). I wrote this patch to guarantee forward progress,
>> which is mandatory for semantic correctness.
>>
>> Perhaps drain_call_rcu() may make sense also in other, performance-
>> sensitive scenarios, but it should be added after benchmark or we will
>> have a immature optimization.
> 
> I first thought just adding drain_call_rcu() would work but apparently
> it is not that simple. Adding drain_call_rcu() has a few problems:
> 
> - It drops the BQL, which should be avoided. Problems caused by
> run_on_cpu(), which drops the BQL, was discussed on the list for a few
> times and drain_call_rcu() may also suffer from them.
> 
> - It is less effective if the RCU thread enters g_usleep() before
> drain_call_rcu() is called.
> 
> - It slows down readers due to the nature of drain_call_rcu().
> 
> So, if you know some workload that may suffer from the delay, it may be
> a good idea to try them with the patches from Alex first, and then think
> of a clean solution if it improves performance.

Thanks a lot for the clarification. I'm seeing occasional 10ms stalls
with this patch applied, still it's a huge improvement. Looking forward
to v2.

In addition to a guest waiting for the virgl commands completion, QEMU
display updates on host are also blocked while unmapping cmd is
suspended. This is a noticeable problem for interactive GFX applications
running on guest.

-- 
Best regards,
Dmitry

Reply via email to