On Thu, 28 Mar 2024 07:40:14 -0500, Eric Blake wrote: > On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang via wrote: > > If g_main_loop_run()/aio_poll() is called in the coroutine context, > > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup > > may be disordered. > > > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means > > some listened events is completed. Therefore, the completion callback > > function is dispatched. > > > > If this callback function needs to invoke aio_co_enter(), it will only > > wake up the coroutine (because we are already in coroutine context), > > which may cause that the data on this listening event_fd/socket_fd > > is not read/cleared. When the next poll () exits, it will be woken up again > > and inserted into the wakeup queue again. > > > > For example, if TLS is enabled in NBD, the server will call > > g_main_loop_run() > > in the coroutine, and repeatedly wake up the io_read event on a socket. > > The call stack is as follows: > > > > aio_co_enter() > > aio_co_wake() > > qio_channel_restart_read() > > aio_dispatch_handler() > > aio_dispatch_handlers() > > aio_dispatch() > > aio_ctx_dispatch() > > g_main_context_dispatch() > > g_main_loop_run() > > nbd_negotiate_handle_starttls() > > nbd_negotiate_options() > > nbd_negotiate() > > nbd_co_client_start() > > coroutine_trampoline() > > zhuyangyang, do you have a reliable reproduction setup for how you > were able to trigger this? Obviously, it only happens when TLS is > enabled (we aren't creating a g_main_loop_run for any other NBD > command), and only when the server is first starting to serve a > client; is this a case where you were hammering a long-running qemu > process running an NBD server with multiple clients trying to > reconnect to the server all near the same time?
This problem cannot be reproduced after 7c1f51bf38 ("nbd/server: Fix drained_poll to wake coroutine in right AioContext") that avoids repeatedly waking up the same coroutine. Invoking g_main_loop_run() in the coroutine will cause that event completion callback function qio_channel_restart_read() is called repeatedly, but the coroutine is woken up only once. The key modifications are as follows: static void qio_channel_restart_read(void *opaque) { QIOChannel *ioc = opaque; - Coroutine *co = ioc->read_coroutine; + Coroutine *co = qatomic_xchg(&ioc->read_coroutine, NULL); + + if (!co) { + return; + } /* Assert that aio_co_wake() reenters the coroutine directly */ assert(qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)); aio_co_wake(co); } The root cause is that poll() is invoked in coroutine context. > > If we can come up with a reliable formula for reproducing the > corrupted coroutine list, it would make a great iotest addition > alongside the existing qemu-iotests 233 for ensuring that NBD TLS > traffic is handled correctly in both server and client. > > > > > Signed-off-by: zhuyangyang <zhuyangyan...@huawei.com> > > Side note: this appears to be your first qemu contribution (based on > 'git shortlog --author zhuyangyang'). While I am not in a position to > presume how you would like your name Anglicized, I will point out that > the prevailing style is to separate given name from family name (just > because your username at work has no spaces does not mean that your > S-o-b has to follow suit). It is also permissible to list your name > in native characters alongside or in place of the Anglicized version; > for example, 'git log --author="Stefano Dong"' shows this technique. -- Best Regards, Zhu Yangyang