Hi, I've been following the qemu-colo.rst steps to test COLO and
encountered a couple of issues. Unfortunately, I don't have cycles to
investigate further. Happens with QEMU master (also tested some versions
back until the COLO fix 0b5bf4ea76).
1) Deadlock at fdmon_io_uring_wait:
(steps from qemu-colo.rst)
- Secondary Failover
- Secondary resume replication
- Start the new Secondary
- Sync
- Wait until disk is synced, then:
{"execute": "stop"}
{"execute": "block-job-cancel", "arguments":{ "device": "resync" } }
The above results in the old secondary hanging indefinitely at:
do {
ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
} while (ret == -EINTR);
(gdb) bt
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f5519e0204e in ??? () at //usr/lib64/liburing.so.2
#2 0x00007f5519e01b00 in ??? () at //usr/lib64/liburing.so.2
#3 0x0000563c2dc06cc9 in fdmon_io_uring_wait (ctx=0x563c30411b00,
ready_list=0x7ffd0bad8f58, timeout=575708467831) at ../util/fdmon-io_uring.c:416
#4 0x0000563c2dc00976 in aio_poll (ctx=0x563c30411b00, blocking=true) at
../util/aio-posix.c:699
#5 0x0000563c2daa01c6 in bdrv_drain_all_begin () at ../block/io.c:529
#6 0x0000563c2daa03d8 in bdrv_drain_all () at ../block/io.c:574
#7 0x0000563c2d764aae in do_vm_stop (state=RUN_STATE_PAUSED, send_stop=true)
at ../system/cpus.c:312
#8 0x0000563c2d765964 in vm_stop (state=RUN_STATE_PAUSED) at
../system/cpus.c:754
#9 0x0000563c2d7f3378 in qmp_stop (errp=0x7ffd0bad9080) at
../monitor/qmp-cmds.c:62
#10 0x0000563c2dba7a72 in qmp_marshal_stop (args=0x563c306ac070,
ret=0x7f5518dffda8, errp=0x7f5518dffda0) at qapi/qapi-commands-misc.c:197
#11 0x0000563c2dbf1316 in do_qmp_dispatch_bh (opaque=0x7f5518dffe40) at
../qapi/qmp-dispatch.c:128
#12 0x0000563c2dc1de48 in aio_bh_call (bh=0x563c3040fef0) at ../util/async.c:173
#13 0x0000563c2dc1df64 in aio_bh_poll (ctx=0x563c3040c070) at
../util/async.c:220
#14 0x0000563c2dbffff0 in aio_dispatch (ctx=0x563c3040c070) at
../util/aio-posix.c:389
#15 0x0000563c2dc1e3cd in aio_ctx_dispatch (source=0x563c3040c070,
callback=0x0, user_data=0x0) at ../util/async.c:365
#16 0x00007f551b114f4c in g_main_dispatch (context=0x563c304120f0) at
../glib/gmain.c:3476
#17 g_main_context_dispatch_unlocked (context=context@entry=0x563c304120f0) at
../glib/gmain.c:4284
#18 0x00007f551b1170c9 in g_main_context_dispatch (context=0x563c304120f0) at
../glib/gmain.c:4272
#19 0x0000563c2dc1fa0b in glib_pollfds_poll () at ../util/main-loop.c:290
#20 0x0000563c2dc1fa85 in os_host_main_loop_wait (timeout=0) at
../util/main-loop.c:313
#21 0x0000563c2dc1fb8a in main_loop_wait (nonblocking=0) at
../util/main-loop.c:592
#22 0x0000563c2d78eb60 in qemu_main_loop () at ../system/runstate.c:903
#23 0x0000563c2db412fc in qemu_default_main (opaque=0x0) at ../system/main.c:50
#24 0x0000563c2db413ab in main (argc=40, argv=0x7ffd0bad94d8) at
../system/main.c:93
---
2) Race at colo_process_checkpoint
The following pattern seems to be inherently racy, whether the switch
statement sees the state as COMPLETED or not varies:
colo_process_checkpoint()
{
...
out:
...
/*
* There are only two reasons we can get here, some error happened
* or the user triggered failover.
*/
--> switch (failover_get_state()) {
case FAILOVER_STATUS_COMPLETED:
qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
COLO_EXIT_REASON_REQUEST);
break;
default:
qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
COLO_EXIT_REASON_ERROR);
}
/* Hope this not to be too long to wait here */
--> qemu_event_wait(&s->colo_exit_event);
...
}
This results in what seems like a spurious:
{"timestamp": {"seconds": 1770984655, "microseconds": 216464}, "event":
"COLO_EXIT", "data": {"mode": "primary", "reason": "error"}}
I'm not sure if the intention is to just ignore it, but it seems moving
the qemu_event_wait before checking the state would eliminate the race.