> Daniel, is this in your area of expertise?
>
> Jie Song, can you identify the commit that introduced the bug?
>
> Jie Song <[email protected]> writes:
>
> > From: Jie Song <[email protected]>
> >
> > When starting a dummy QEMU process with virsh, monitor_init_qmp() enables
> > IOThread monitoring of the QMP fd by default. However, a race condition
> > exists during the initialization phase: the IOThread only removes the
> > main thread's fd watch when it reaches
> > qio_net_listener_set_client_func_full(),
> > which may be delayed under high system load.
> >
> > This creates a window between monitor_qmp_setup_handlers_bh() and
> > qio_net_listener_set_client_func_full() where both the main thread and
> > IOThread are simultaneously monitoring the same fd and processing events.
> > This race can cause either the main thread or the IOThread to hang and
> > become unresponsive.
> >
> > Fix this by proactively cleaning up the listener's IO sources in
> > monitor_init_qmp() before the IOThread initializes QMP monitoring,
> > ensuring exclusive fd ownership and eliminating the race condition.
> >
> > The fix introduces socket_chr_listener_cleanup() to destroy and unref
> > all existing IO sources on the socket chardev listener, guaranteeing
> > that no concurrent fd monitoring occurs during the transition to
> > IOThread handling.
> >
> > Signed-off-by: Jie Song <[email protected]>
> > ---
> > chardev/char-socket.c | 18 ++++++++++++++++++
> > include/chardev/char-socket.h | 2 ++
> > monitor/qmp.c | 6 ++++++
> > 3 files changed, 26 insertions(+)
> >
> > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > index 62852e3caf..073a9da855 100644
> > --- a/chardev/char-socket.c
> > +++ b/chardev/char-socket.c
> > @@ -656,6 +656,24 @@ static void tcp_chr_telnet_destroy(SocketChardev *s)
> > }
> > }
> >
> > +void socket_chr_listener_cleanup(Chardev *chr)
> > +{
> > + SocketChardev *s = SOCKET_CHARDEV(chr);
> > +
> > + if (s->listener) {
> > + QIONetListener *listener = s->listener;
> > + size_t i;
> > +
> > + for (i = 0; i < listener->nsioc; i++) {
> > + if (listener->io_source[i]) {
> > + g_source_destroy(listener->io_source[i]);
> > + g_source_unref(listener->io_source[i]);
> > + listener->io_source[i] = NULL;
> > + }
> > + }
> > + }
> > +}
> > +
> > static void tcp_chr_update_read_handler(Chardev *chr)
> > {
> > SocketChardev *s = SOCKET_CHARDEV(chr);
> > diff --git a/include/chardev/char-socket.h b/include/chardev/char-socket.h
> > index d6d13ad37f..682440c6de 100644
> > --- a/include/chardev/char-socket.h
> > +++ b/include/chardev/char-socket.h
> > @@ -84,4 +84,6 @@ typedef struct SocketChardev SocketChardev;
> > DECLARE_INSTANCE_CHECKER(SocketChardev, SOCKET_CHARDEV,
> > TYPE_CHARDEV_SOCKET)
> >
> > +void socket_chr_listener_cleanup(Chardev *chr);
> > +
> > #endif /* CHAR_SOCKET_H */
> > diff --git a/monitor/qmp.c b/monitor/qmp.c
> > index cb99a12d94..d9d1fafa70 100644
> > --- a/monitor/qmp.c
> > +++ b/monitor/qmp.c
> > @@ -25,6 +25,7 @@
> > #include "qemu/osdep.h"
> >
> > #include "chardev/char-io.h"
> > +#include "chardev/char-socket.h"
> > #include "monitor-internal.h"
> > #include "qapi/error.h"
> > #include "qapi/qapi-commands-control.h"
> > @@ -537,6 +538,11 @@ void monitor_init_qmp(Chardev *chr, bool pretty, Error
> > **errp)
> > * e.g. the chardev is in client mode, with wait=on.
> > */
> > remove_fd_in_watch(chr);
> > + /*
> > + * Clean up listener IO sources early to prevent racy fd
> > + * handling between the main thread and the I/O thread.
> > + */
> > + socket_chr_listener_cleanup(chr);
> > /*
> > * We can't call qemu_chr_fe_set_handlers() directly here
> > * since chardev might be running in the monitor I/O
Hi Markus,
Thank you for the question.
The issue you're referring to is not tied to any specific commit but rather
arises from the current process flow. Specifically, in scenarios like the one
with virsh starting a dummy QEMU process, the following command line may
triggers the bug:
`/usr/bin/qemu-system-x86_64 -S -no-user-config -nodefaults -nographic -machine
none,accel=tcg -qmp
unix:/var/lib/libvirt/qemu/qmp-xxx/qmp.monitor,server=on,wait=off`
We can reproduce this issue using gdb with the following steps:
1.Pause the I/O thread: Execute monitor_init_qmp in the main thread, and
before
aio_bh_schedule_oneshot is called, suspend the I/O thread
(scheduler-locking on).
This simulates a high load scenario.
2.Set a breakpoint at qemu_accept: Allow the main thread to continue running.
The main thread will reach qemu_accept, and at this point, the main thread
will
be listening for the corresponding chardev (the QMP socket).
3.Simulate a client connection: Use nc -U to simulate a client connecting to
the
Unix socket. The main thread will detect the event and hit the breakpoint
at qemu_accept.
4.Resume the I/O thread: Now, switch to the I/O thread and allow it to run.
It will also reach the qemu_accept breakpoint, creating a race condition
where
both threads are handling the same accept event.
This race causes either the main thread or the IOThread to hang and become
unresponsive.
The issue stems from the window between when the main thread sets up the
listener watch and
when the IOThread takes over exclusive ownership. Under normal conditions this
window is
very small, but under high load or with specific timing, both threads can end
up processing
events on the same fd simultaneously.
I hope this explanation clarifies the issue.
Best regards,
Jie Song