Re: [PATCH v2 1/1] io: make zerocopy fallback accounting more accurate

Tejus GK Mon, 16 Mar 2026 09:27:18 -0700

On 12/03/26 12:13 am, Peter Xu wrote:
> !-------------------------------------------------------------------|
>   CAUTION: External Email
> |-------------------------------------------------------------------!
> On Wed, Mar 11, 2026 at 05:46:56PM +0000, Daniel P. Berrangé wrote:
>> On Wed, Mar 11, 2026 at 01:28:36PM -0400, Peter Xu wrote:
>>> On Wed, Mar 11, 2026 at 04:56:17PM +0000, Daniel P. Berrangé wrote:
>>>> On Wed, Mar 11, 2026 at 11:30:26AM -0400, Peter Xu wrote:
>>>>> On Wed, Mar 11, 2026 at 12:02:05PM +0000, Daniel P. Berrangé wrote:
>>>>>> On Mon, Mar 09, 2026 at 02:21:49PM -0400, Peter Xu wrote:
>>>>>>> On Mon, Mar 09, 2026 at 05:51:29PM +0000, Daniel P. Berrangé wrote:
>>>>>>>> On Mon, Mar 09, 2026 at 05:42:08PM +0000, Tejus GK wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 9 Mar 2026, at 10:47 PM, Daniel P. Berrangé <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> !-------------------------------------------------------------------|
>>>>>>>>>>  CAUTION: External Email
>>>>>>>>>> 
>>>>>>>>>> |-------------------------------------------------------------------!
>>>>>>>>>> 
>>>>>>>>>> On Mon, Mar 09, 2026 at 12:59:44PM -0400, Peter Xu wrote:
>>>>>>>>>>> On Mon, Mar 09, 2026 at 04:48:37PM +0000, Daniel P. Berrangé wrote:
>>>>>>>>>>>>> @@ -881,8 +881,8 @@ static int 
>>>>>>>>>>>>> qio_channel_socket_flush_internal(QIOChannel *ioc,
>>>>>>>>>>>>>         sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
>>>>>>>>>>>>> 
>>>>>>>>>>>>>         /* If any sendmsg() succeeded using zero copy, mark 
>>>>>>>>>>>>> zerocopy success */
>>>>>>>>>>>>> -        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
>>>>>>>>>>>>> -            sioc->new_zero_copy_sent_success = true;
>>>>>>>>>>>>> +        if (serr->ee_code == SO_EE_CODE_ZEROCOPY_COPIED) {
>>>>>>>>>>>>> +            sioc->zero_copy_fallback++;
>>>>>>>>>>>> 
>>>>>>>>>>>> ...this is counting the number of MSG_ERRQUEUE items, which is not
>>>>>>>>>>>> the same as the number of IO requests. That's why we only used it
>>>>>>>>>>>> as a boolean marker originally, rather than making it a counter.
>>>>>>>>>>> 
>>>>>>>>>>> Would the logic still work and better than before?  Say, it's a 
>>>>>>>>>>> counter of
>>>>>>>>>>> "messages" rather than "IOs" then.
>>>>>>>>>> 
>>>>>>>>>> IIUC it is a counter of processing notifications which is not 
>>>>>>>>>> directly
>>>>>>>>>> correlated to any action by QEMU - neither bytes nor syscalls.
>>>>>>>>> 
>>>>>>>>> Please correct me if I'm wrong about this, isn’t each notification an 
>>>>>>>>> information
>>>>>>>>> about what happened to an individual IO?
>>>>>>>> 
>>>>>>>> If userspace hasn't read a queued notification yet, the kernel will
>>>>>>>> merge new notifications with the existing queued one.
>>>>>>>> 
>>>>>>>> The line above your change
>>>>>>>> 
>>>>>>>>   serr->ee_data - serr->ee_info + 1;
>>>>>>>> 
>>>>>>>> records how many notifications were merged, so we now how many
>>>>>>>> syscalls were processed.
>>>>>>>> 
>>>>>>>> If ee_code is  SO_EE_CODE_ZEROCOPY_COPIED though it means at least
>>>>>>>> one syscall resulted in a copy, but that doesn't imply that *all*
>>>>>>>> syscalls resulted in a copy.
>>>>>>>> 
>>>>>>>> AFAICT, it could be 1 out of a 1000 syscalls resulted in a copy,
>>>>>>>> or it could be 1000 out of 1000 resulted in a copy. We don't know.
>>>>>>>> 
>>>>>>>> IIUC the kernel's merging of notifications appears lossy wrt this
>>>>>>>> information. It could be partially mitigated by doing a flush for
>>>>>>>> notifications really really frequently but that feels like it would
>>>>>>>> have its own downsides
>>>>>>> 
>>>>>>> IMHO what this change does is removing the false negatives.
>>>>>>> 
>>>>>>> Before this patch, if QEMU reports fallback=0, it doesn't mean all the
>>>>>>> MSG_ZEROCOPY requests were all fulfilled by zerocopy.  It's because we
>>>>>>> justify it with one boolean over "a period of time" between two 
>>>>>>> flushes, we
>>>>>>> set the boolean to TRUE as long as there is _one_ successful report of
>>>>>>> MSG_ZEROCOPY.  So even if every flush reports TRUE it only means "there 
>>>>>>> is
>>>>>>> at least one MSG_ZEROCOPY request that didn't fallback".  It has no
>>>>>>> implication of whether a fallback happened.
>>>>>>> 
>>>>>>> Hence, before this v2 patch, there can be false negative reported by 
>>>>>>> QEMU,
>>>>>>> assuming there's no fallback (reflected in stats) but it actually 
>>>>>>> happened.
>>>>>>> 
>>>>>>> After this patch, if QEMU reports fallback=0, it guarantees that _all_
>>>>>>> MSG_ZEROCOPY requests are fulfilled with zerocopy.  It's because we 
>>>>>>> monitor
>>>>>>> all messages and accumulate any fallback cases.  Even if the messages 
>>>>>>> can
>>>>>>> be merged, when "fallback" shows anything non-zero would imply some
>>>>>>> fallback happened.  Here, the counter value doesn't really matter much
>>>>>>> IMHO, as long as it becomes non-zero.
>>>>>> 
>>>>>> AFAICT, the v1 of this patch was sufficient to address the original
>>>>>> bug and maintain the current intended semantics of the migration
>>>>>> counter. This v2 is mixing a bug fix with functional change in
>>>>>> behaviour and I don't think the latter is justified.
>>>>> 
>>>>> It's just that when it cannot report all fallback cases, I don't yet see
>>>>> how it would help much even if we fix the previous behavior with v1..
>>>>> 
>>>>> OTOH, the new behavior will be deemed to have no issue on the problem v1
>>>>> was fixing.
>>>>> 
>>>>> So IIUC v2's behavior is the one we want, and helps identify fallback
>>>>> happened.
>>>> 
>>>> I don't consider v2 acceptable as the value its returning is an
>>>> meaningless counter that doesn't correlate to any quantity that
>>>> is used by QEMU, nor visible to users of QEMU.
>>> 
>>> It can be a boolean if we want showing "if any fallback happened", that'll
>>> at least make it accurate and avoid false negatives. But IMHO a counter is
>>> always better, e.g. when we dump it from time to time we know if any more
>>> fallback happened.
>>> 
>>> In that case, no matter how that counter is defined in granularity that'll
>>> help, as long as it get boosted when fallback happened.
>>> 
>>> I also don't expect this value to be consumed by an user, but only reported
>>> by an user and should only be consumed by a developer.
>> 
>> Ok, so the problem is that we've got a design inversion between what
>> the kernel is reporting and what the io channel is reporting.
>> 
>> With the kernel notifications we can determine
>> 
>>  * All syscalls successfully used zero copy
>>  * At least one syscall failed to use zero copy
>> 
>> whereas what the io channel flush is (claiming) to report is
>> 
>>  * 1 => all syscalls failed to use zero copy
>>  * 0 => at least one syscall successfully used zero copy
>> 
>> and you cannot infer the latter from the former, as we have missing
>> information due to merging of notifications.
>> 
>> So we need to invert the return values semantics of the flush method
>> to account for the missing information:
>> 
>>  * 1 => at least one syscall failed to use zero copy
>>  * 0 => all syscalls successfully used zero copy
> Yep, this should be one good way to nail this problem.  Maybe Tejus, as a
> real consumer of this counter, will have a preference on how it looks the
> best.
> Thanks,
Hi all! Thank you for the suggestions, and apologies on the delay on this.


>  * 1 => at least one syscall failed to use zero copy
>  * 0 => all syscalls successfully used zero copy

I think this return semantic seems appropriate, and avoids the false positives 
like earlier. I can spin up a v3 
if everyone agrees on this?


Regards,
Tejus

Re: [PATCH v2 1/1] io: make zerocopy fallback accounting more accurate

Reply via email to