Hi all! As Peter recently noted, iotest 30 accidentally fails.
I found that Qemu crashes due to interleaving of graph-update operations of parallel mirror and stream block-jobs. So, here is a "workaround" to discuss. It's of course not the full solution, as if we decide to go this way we should protect by the mutex all graph-modifying operations, not only here. And move everything into coroutine.. So, I send this mostly as a starting point for discussion, may be someone imagine better solution. Main patches are 04-05. 01-02 only simplify debugging and 03 is preparation for 04. Original qemu crash looks like this: #0 0x00007f7029b23e35 in raise () at /lib64/libc.so.6 #1 0x00007f7029b0e895 in abort () at /lib64/libc.so.6 #2 0x00007f7029b0e769 in _nl_load_domain.cold () at /lib64/libc.so.6 #3 0x00007f7029b1c566 in annobin_assert.c_end () at /lib64/libc.so.6 #4 0x0000558f3d92f15a in bdrv_replace_child (child=0x558f3fa7c400, new_bs=0x0) at ../block.c:2648 #5 0x0000558f3d92f6e1 in bdrv_detach_child (child=0x558f3fa7c400) at ../block.c:2777 #6 0x0000558f3d92f723 in bdrv_root_unref_child (child=0x558f3fa7c400) at ../block.c:2789 #7 0x0000558f3d897f4c in block_job_remove_all_bdrv (job=0x558f3f626940) at ../blockjob.c:191 #8 0x0000558f3d897c73 in block_job_free (job=0x558f3f626940) at ../blockjob.c:88 #9 0x0000558f3d891456 in job_unref (job=0x558f3f626940) at ../job.c:380 #10 0x0000558f3d892602 in job_exit (opaque=0x558f3f626940) at ../job.c:894 #11 0x0000558f3d9ce2fb in aio_bh_call (bh=0x558f3f5dc480) at ../util/async.c:136 #12 0x0000558f3d9ce405 in aio_bh_poll (ctx=0x558f3e80c5f0) at ../util/async.c:164 #13 0x0000558f3d9f75ea in aio_dispatch (ctx=0x558f3e80c5f0) at ../util/aio-posix.c:381 #14 0x0000558f3d9ce836 in aio_ctx_dispatch (source=0x558f3e80c5f0, callback=0x0, user_data=0x0) at ../util/async.c:306 #15 0x00007f702ae75ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #16 0x0000558f3da09e33 in glib_pollfds_poll () at ../util/main-loop.c:221 #17 0x0000558f3da09ead in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:244 #18 0x0000558f3da09fb5 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:520 #19 0x0000558f3d7836b7 in qemu_main_loop () at ../softmmu/vl.c:1678 #20 0x0000558f3d317316 in main (argc=20, argv=0x7fffa94d35a8, envp=0x7fffa94d3650) at ../softmmu/main.c:50 (gdb) fr 4 #4 0x0000558f3d92f15a in bdrv_replace_child (child=0x558f3fa7c400, new_bs=0x0) at ../block.c:2648 2648 assert(tighten_restrictions == false); (gdb) list 2643 int ret; 2644 2645 bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm); 2646 ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL, 2647 &tighten_restrictions, NULL); 2648 assert(tighten_restrictions == false); 2649 if (ret < 0) { 2650 /* We only tried to loosen restrictions, so errors are not fatal */ 2651 bdrv_abort_perm_update(old_bs); 2652 } else { And my exploration shows that this due to permission-graph already broken before this permission update. So we tighten restrictions not because removing the child but because we recalculate broken permissions graph and it becomes correct (and more strict unfortunately). Also, please look through my explorations on this topic in threads: "iotest 030 still occasionally intermittently failing" https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg04018.html "question about bdrv_replace_node" https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg04478.html Vladimir Sementsov-Ogievskiy (5): abort-on-set-to-true iotest-30-shorten: concentrate on failing test case scripts/block-coroutine-wrapper.py: allow more function types block: move some mirror and stream handlers to coroutine block: protect some graph-modifyng things by mutex block/coroutines.h | 11 +++++++ include/block/block.h | 2 ++ block.c | 36 +++++++++++++++------ block/mirror.c | 9 ++++-- block/stream.c | 9 ++++-- scripts/block-coroutine-wrapper.py | 36 +++++++++++++-------- tests/qemu-iotests/030 | 52 +++++++++++++++--------------- tests/qemu-iotests/030.out | 4 +-- 8 files changed, 105 insertions(+), 54 deletions(-) -- 2.21.3