Hi all!

As Peter recently noted, iotest 30 accidentally fails.

I found that Qemu crashes due to interleaving of graph-update operations of 
parallel mirror and stream block-jobs.

So, here is a "workaround" to discuss.

It's of course not the full solution, as if we decide to go this way we should 
protect by the mutex all graph-modifying operations, not only here. And move 
everything into coroutine..

So, I send this mostly as a starting point for discussion, may be someone 
imagine better solution.

Main patches are 04-05. 01-02 only simplify debugging and 03 is
preparation for 04.

Original qemu crash looks like this:

#0  0x00007f7029b23e35 in raise () at /lib64/libc.so.6
#1  0x00007f7029b0e895 in abort () at /lib64/libc.so.6
#2  0x00007f7029b0e769 in _nl_load_domain.cold () at /lib64/libc.so.6
#3  0x00007f7029b1c566 in annobin_assert.c_end () at /lib64/libc.so.6
#4  0x0000558f3d92f15a in bdrv_replace_child (child=0x558f3fa7c400, new_bs=0x0) 
at ../block.c:2648
#5  0x0000558f3d92f6e1 in bdrv_detach_child (child=0x558f3fa7c400) at 
../block.c:2777
#6  0x0000558f3d92f723 in bdrv_root_unref_child (child=0x558f3fa7c400) at 
../block.c:2789
#7  0x0000558f3d897f4c in block_job_remove_all_bdrv (job=0x558f3f626940) at 
../blockjob.c:191
#8  0x0000558f3d897c73 in block_job_free (job=0x558f3f626940) at 
../blockjob.c:88
#9  0x0000558f3d891456 in job_unref (job=0x558f3f626940) at ../job.c:380
#10 0x0000558f3d892602 in job_exit (opaque=0x558f3f626940) at ../job.c:894
#11 0x0000558f3d9ce2fb in aio_bh_call (bh=0x558f3f5dc480) at ../util/async.c:136
#12 0x0000558f3d9ce405 in aio_bh_poll (ctx=0x558f3e80c5f0) at 
../util/async.c:164
#13 0x0000558f3d9f75ea in aio_dispatch (ctx=0x558f3e80c5f0) at 
../util/aio-posix.c:381
#14 0x0000558f3d9ce836 in aio_ctx_dispatch (source=0x558f3e80c5f0, 
callback=0x0, user_data=0x0)
    at ../util/async.c:306
#15 0x00007f702ae75ecd in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#16 0x0000558f3da09e33 in glib_pollfds_poll () at ../util/main-loop.c:221
#17 0x0000558f3da09ead in os_host_main_loop_wait (timeout=0) at 
../util/main-loop.c:244
#18 0x0000558f3da09fb5 in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:520
#19 0x0000558f3d7836b7 in qemu_main_loop () at ../softmmu/vl.c:1678
#20 0x0000558f3d317316 in main (argc=20, argv=0x7fffa94d35a8, 
envp=0x7fffa94d3650)
    at ../softmmu/main.c:50
(gdb) fr 4
#4  0x0000558f3d92f15a in bdrv_replace_child (child=0x558f3fa7c400, new_bs=0x0) 
at ../block.c:2648
2648            assert(tighten_restrictions == false);
(gdb) list
2643            int ret;
2644
2645            bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
2646            ret = bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL,
2647                                  &tighten_restrictions, NULL);
2648            assert(tighten_restrictions == false);
2649            if (ret < 0) {
2650                /* We only tried to loosen restrictions, so errors are not 
fatal */
2651                bdrv_abort_perm_update(old_bs);
2652            } else {


And my exploration shows that this due to permission-graph already broken 
before this permission update. So we tighten restrictions not because removing 
the child but because we recalculate broken permissions graph and it becomes 
correct (and more strict unfortunately).


Also, please look through my explorations on this topic in threads:

"iotest 030 still occasionally intermittently failing"
https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg04018.html

"question about bdrv_replace_node"
https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg04478.html

Vladimir Sementsov-Ogievskiy (5):
  abort-on-set-to-true
  iotest-30-shorten: concentrate on failing test case
  scripts/block-coroutine-wrapper.py: allow more function types
  block: move some mirror and stream handlers to coroutine
  block: protect some graph-modifyng things by mutex

 block/coroutines.h                 | 11 +++++++
 include/block/block.h              |  2 ++
 block.c                            | 36 +++++++++++++++------
 block/mirror.c                     |  9 ++++--
 block/stream.c                     |  9 ++++--
 scripts/block-coroutine-wrapper.py | 36 +++++++++++++--------
 tests/qemu-iotests/030             | 52 +++++++++++++++---------------
 tests/qemu-iotests/030.out         |  4 +--
 8 files changed, 105 insertions(+), 54 deletions(-)

-- 
2.21.3


Reply via email to