Hi all,

I've been struggling with a bug which seems to be linked to several issues
in the polling system on Windows hosts.

When connecting gdb to a qemu-system (it happens with all the emulations
I've tried), I've discovered that sometimes a latency appears. It happens
with all the commands but it is really noticeable with "call" commands. It can
take more than 20s to complete.

While investigating it seems that the polling system misses some events and
thus waits for the timeout of g_poll (1s) before handling them. It can be seen
with any program launched with gdbstub_io_command traces.
$ gdb-system-arm -s -S
...
gdbstub_io_command Received: m422650,8
gdbstub_io_command Received: m422650,8
     Freeze for less than one second
gdbstub_io_command Received: P1f=d09fca0000000000
gdbstub_io_command Received: m422650,8
....

This is random but pretty obvious when the freeze happens.

An important note is that it's triggered by newer versions of glib. We have
a qemu-6 built with glib-2.54 where everything is fine, but when rebuilding
it with glib-2.60 this problem appears. I didn't check yet with glib
2.56 or 2.58
because it's still using the autoconf approach instead of meson.
Anyway, I didn't find any obvious glib commits which could have introduced this
issue. If anyone more experienced with glib has an idea, I'm interested.

Afterwards, I've dug into qemu core and how it sets up the connection between
gdb and qemu. And I have several questions / ideas about what is happening.

IIUC, the gdb connection is handled using an io/channel-watch. This adds a
GSource for our given socket (-S being a tcp connection) to be polled
by the main
loop.
For Windows, qio_channel_socket_source_check is the function used for the
check operation. In this function, we are both calling WSAEnumNetworkEvents
and select. The first one seems here only to reset the events while the second
retrieves them. However, it's not an atomic operation. So my guess is that some
events are lost during these two operations. I've tried several
solutions around that
move WSAEnumNetworkEvents after select, replace it with WSAResetEvent, use
auto/manual reset in CreateEvent. None of them worked.

Afterwards, I've tried to replace select by just WSAEnumNetworkEvents which
is supposed to be enough.  But I've faced another issue.
We have two sources connected to the same socket. These two sources have
different conditions G_IO_HUP vs G_IO_IN + G_IO_OUT + ... It's fine on Linux
but on Windows, it seems to be problematic as I'm getting the Read event on the
GSource having just G_IO_HUP. It's kind of logical as Windows API only knows
about HANDLE which is the same in both cases. I've made a quick attempt to
create another HANDLE for the second GSource. But it didn't work.

The GSource with G_IO_HUP is created by:
#0  qio_channel_create_socket_watch (... condition=G_IO_HUP) at
io/channel-watch.c
#1  qio_channel_create_watch at io/channel.c
#2  update_ioc_handlers at chardev/char-socket.c
#3  tcp_chr_connect at chardev/char-socket.c
#4  tcp_chr_new_client at chardev/char-socket.c
#5  qio_net_listener_channel_func at io/net-listener.c
#6  g_main_dispatch at glib/gmain.c
#7  g_main_context_dispatch at glib/gmain.c
#8  os_host_main_loop_wait at util/main-loop.c:480
...

The other is made during the poll_prepare and added as a child_source of
the first one.
#0  qio_channel_create_socket_watch (..., condition=(G_IO_IN |
G_IO_OUT | G_IO_ERR | G_IO_HUP | G_IO_NVAL)) at io/channel-watch.c
#1  qio_channel_create_watch at io/channel.c
#2  io_watch_poll_prepare at chardev/char-io.c
#3  io_watch_poll_prepare at chardev/char-io.c
#4  g_main_context_prepare at glib/gmain.c
#5  os_host_main_loop_wait at util/main-loop.c
...

I'm not familiar enough with glib to know if these child_source are working
fine on Windows.

I'm currently trying to change the approach and instead of creating a
new source,
I want to update the previous one. But it needs some important modifications.
As I'm a bit taken by the time, I'm looking for a workaround and any
advice on that.
For now, the only workaround I've found is to reduce the timeout in g_poll to
catch the missed events earlier...

@Paolo, you were the one implementing the part in io/channel-watch in
a5897205677, do you have any ideas or suggestions ?

I'll try to send an update with a reproducer. But I didn't have time
to create it
yet.

Thanks in advance
Clément

Reply via email to