On Mon, Jun 13, 2022 at 7:53 PM Peter Xu <pet...@redhat.com> wrote: > > On Mon, Jun 13, 2022 at 05:58:44PM -0300, Leonardo Bras Soares Passos wrote: > > Hello Peter, > > > > On Wed, Jun 8, 2022 at 5:23 PM Peter Xu <pet...@redhat.com> wrote: > > [...] > > > > In a previous iteration of the patchset, it was made clear that it's > > > > desirable to detect when the kernel falls back to copying mechanism, > > > > so the user of 'QIOChannelSocket' can switch to copying and avoid the > > > > overhead. This was done by the return value of flush(), which is 1 if > > > > that occurs. > > > > > > Two questions.. > > > > > > 1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy > > > is functional? > > > > I am not sure about what exactly you meant by 'like zerocopy is > > funcional', but the > > idea is that reading from MSG_ERRQUEUE should return a msg for each sendmsg > > syscall with MSG_ZEROCOPY that previously happened. This does not depend on > > the outcome (like falling back to the copying mechanism). > > btw, most of those messages may be batched to reduce overhead. > > > > At some point, zero-copy may fail, and fall back to copying, so in > > those messages > > an error code SO_EE_CODE_ZEROCOPY_COPIED can be seen. Having only > > those messages in a flush will trigger the returning of 1 from the > > flush function. > > Ah I think I missed the "reset ret==0 when !SO_EE_CODE_ZEROCOPY_COPIED" > path.. Sorry. > > > > > > > > > If the answer is yes, I don't see how ret=1 will ever be > > > returned.. because we'll also go into the same loop in > > > qio_channel_socket_flush() anyway. > > > > > > We set ret to 1 at function entry and then for each message in the > > MSG_ERRQUEUE, > > we test if it has error code different than SO_EE_CODE_ZEROCOPY_COPIED. > > If it ever have a different error code, we set ret=0. > > > > So, in our previous example, if we have a net device not supporting > > the 'Scatter-Gather' > > feature (NETIF_F_SG), every error message will be > > SO_EE_CODE_ZEROCOPY_COPIED, and it will return 1. > > > > > > > > > > If the answer is no, then since we'll have non-zero zero_copy_queued, > > > will the loop in qio_channel_socket_flush() go into a dead one? How > > > could it return? > > > > No, because it will go through all packets sent with MSG_ZEROCOPY, > > including the > > ones that fell back to copying, so the counter should be fine. If any > > code disables > > zero-copy, it will both stop sending stuff wil MSG_ZEROCOPY and flushing, > > so it > > should be fine. > > > > > > > > 2) Even if we have the correct ret=1 returned when that happens, which > > > caller is detecting that ret==1 and warn the admin? > > > > > > > No caller is using that right now. > > It's supposed to be a QIOChannel interface feature, and any > > user/implementation > > could use that information to warn if zero-copy is not being used, fall > > back to > > copying directly (to avoid overhead of testing zero-copy) or even use > > it to cancel the > > sending if wanted. > > > > It was a suggestion of Daniel on top of [PATCH v5 1/6] IIRC. > > OK the detection makes sense, thanks for the details. > > Then now I'm wondering whether we should have warned the admin already if > zero-copy send is not fully enabled in live migration. Should we add a > error_report_once() somewhere for the ret==1 already? After all the user > specify zero_copy_send=true explicitly. Did I miss something again? >
You are correct, I think warning the user is the valid thing to have here. At the end of the first iteration, where the first flush happens, I think it's too late to fail the migration, since a huge lot of the data has already been sent. Best regards, Leo