Am 13.09.2018 um 22:55 hat Max Reitz geschrieben:
> On 13.09.18 14:52, Kevin Wolf wrote:
> > When starting an active commit job, other callbacks can run before
> > mirror_start_job() calls bdrv_ref() where needed and cause the nodes to
> > go away. Add another pair of bdrv_ref/unref() around it to protect
> > against this case.
> > 
> > Signed-off-by: Kevin Wolf <kw...@redhat.com>
> > ---
> >  block/mirror.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> 
> Reviewed-by: Max Reitz <mre...@redhat.com>
> 
> But...  How?

Tried to reproduce the other bug Paolo was concerned about (good there
is something like 'git reflog'!) and dug up this one instead.

So the scenario seems to be test_stream_commit_1 in qemu-iotests 030.

The backtrace when the BDS is deleted is the following:

(rr) bt
#0  0x00007faeaf6145ec in __memset_sse2_unaligned_erms () at /lib64/libc.so.6
#1  0x00007faeaf60414e in _int_free () at /lib64/libc.so.6
#2  0x00007faeaf6093be in free () at /lib64/libc.so.6
#3  0x00007faecaea6b4e in g_free () at /lib64/libglib-2.0.so.0
#4  0x000055c0c94f9d6c in bdrv_delete (bs=0x55c0cbe69240) at block.c:3590
#5  0x000055c0c94fbe82 in bdrv_unref (bs=0x55c0cbe69240) at block.c:4638
#6  0x000055c0c94f6761 in bdrv_root_unref_child (child=0x55c0cbe57af0) at 
block.c:2188
#7  0x000055c0c94fe239 in block_job_remove_all_bdrv (job=0x55c0ccf9e220) at 
blockjob.c:200
#8  0x000055c0c94fdf1f in block_job_free (job=0x55c0ccf9e220) at blockjob.c:94
#9  0x000055c0c94ffe0d in job_unref (job=0x55c0ccf9e220) at job.c:368
#10 0x000055c0c95007cd in job_do_dismiss (job=0x55c0ccf9e220) at job.c:641
#11 0x000055c0c95008d9 in job_conclude (job=0x55c0ccf9e220) at job.c:667
#12 0x000055c0c9500b66 in job_finalize_single (job=0x55c0ccf9e220) at job.c:735
#13 0x000055c0c94ff4fa in job_txn_apply (txn=0x55c0cbe19eb0, fn=0x55c0c9500a62 
<job_finalize_single>, lock=true) at job.c:151
#14 0x000055c0c9500e92 in job_do_finalize (job=0x55c0ccf9e220) at job.c:822
#15 0x000055c0c9501040 in job_completed_txn_success (job=0x55c0ccf9e220) at 
job.c:872
#16 0x000055c0c950115c in job_completed (job=0x55c0ccf9e220, ret=0, error=0x0) 
at job.c:892
#17 0x000055c0c92b572c in stream_complete (job=0x55c0ccf9e220, 
opaque=0x55c0cc050bc0) at block/stream.c:96
#18 0x000055c0c950142b in job_defer_to_main_loop_bh (opaque=0x55c0cbe38a50) at 
job.c:981
#19 0x000055c0c96171fc in aio_bh_call (bh=0x55c0cbe19f20) at util/async.c:90
#20 0x000055c0c9617294 in aio_bh_poll (ctx=0x55c0cbd7b8d0) at util/async.c:118
#21 0x000055c0c961c150 in aio_poll (ctx=0x55c0cbd7b8d0, blocking=true) at 
util/aio-posix.c:690
#22 0x000055c0c957246c in bdrv_flush (bs=0x55c0cbe4b250) at block/io.c:2693
#23 0x000055c0c94f8e46 in bdrv_reopen_prepare (reopen_state=0x55c0cbef4d68, 
queue=0x55c0cbeab9f0, errp=0x7ffd65617050) at block.c:3206
#24 0x000055c0c94f883a in bdrv_reopen_multiple (ctx=0x55c0cbd7b8d0, 
bs_queue=0x55c0cbeab9f0, errp=0x7ffd656170c8) at block.c:3032
#25 0x000055c0c94f8a00 in bdrv_reopen (bs=0x55c0cbe2c250, bdrv_flags=8194, 
errp=0x7ffd65617220) at block.c:3075
#26 0x000055c0c956a23f in commit_active_start (job_id=0x0, bs=0x55c0cbd905b0, 
base=0x55c0cbe2c250, creation_flags=0, speed=0, 
on_error=BLOCKDEV_ON_ERROR_REPORT, filter_node_name=0x0, cb=0x0, opaque=0x0, 
auto_complete=false, errp=0x7ffd65617220) at block/mirror.c:1687
#27 0x000055c0c927437c in qmp_block_commit (has_job_id=false, job_id=0x0, 
device=0x55c0cbef4d20 "drive0", has_base_node=false, base_node=0x0, 
has_base=true, base=0x55c0cbe38a00 
"/home/kwolf/source/qemu/tests/qemu-iotests/scratch/img-3.img", 
has_top_node=false, top_node=0x0, has_top=false, top=0x0, 
has_backing_file=false, backing_file=0x0, has_speed=false, speed=0, 
has_filter_node_name=false, filter_node_name=0x0, errp=0x7ffd656172d8) at 
blockdev.c:3325
#28 0x000055c0c928bb08 in qmp_marshal_block_commit (args=<optimized out>, 
ret=<optimized out>, errp=0x7ffd656173b8) at qapi/qapi-commands-block-core.c:409
#29 0x000055c0c960beef in do_qmp_dispatch (errp=0x7ffd656173b0, 
allow_oob=false, request=<optimized out>, cmds=0x55c0c9f56f80 <qmp_commands>) 
at qapi/qmp-dispatch.c:129
#30 0x000055c0c960beef in qmp_dispatch (cmds=0x55c0c9f56f80 <qmp_commands>, 
request=<optimized out>, allow_oob=<optimized out>) at qapi/qmp-dispatch.c:171
#31 0x000055c0c9111623 in monitor_qmp_dispatch (mon=0x55c0cbdbb930, 
req=0x55c0ccf90e00, id=0x0) at /home/kwolf/source/qemu/monitor.c:4168
#32 0x000055c0c911191f in monitor_qmp_bh_dispatcher (data=0x0) at 
/home/kwolf/source/qemu/monitor.c:4237
#33 0x000055c0c96171fc in aio_bh_call (bh=0x55c0cbccc840) at util/async.c:90
#34 0x000055c0c9617294 in aio_bh_poll (ctx=0x55c0cbccc550) at util/async.c:118
#35 0x000055c0c961b723 in aio_dispatch (ctx=0x55c0cbccc550) at 
util/aio-posix.c:436
#36 0x000055c0c961762f in aio_ctx_dispatch (source=0x55c0cbccc550, 
callback=0x0, user_data=0x0) at util/async.c:261
#37 0x00007faecaea1257 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#38 0x000055c0c961a2aa in glib_pollfds_poll () at util/main-loop.c:215
#39 0x000055c0c961a2aa in os_host_main_loop_wait (timeout=0) at 
util/main-loop.c:238
#40 0x000055c0c961a2aa in main_loop_wait (nonblocking=<optimized out>) at 
util/main-loop.c:497
#41 0x000055c0c92805e1 in main_loop () at vl.c:1866
#42 0x000055c0c9287eca in main (argc=18, argv=0x7ffd65617958, 
envp=0x7ffd656179f0) at vl.c:4647
(rr) p bs.node_name 
$17 = "node1", '\000' <repeats 26 times>

Obviously, this exact backtrace shouldn't even be possible like this
because job_defer_to_main_loop_bh() shouldn't be called inside
bdrv_flush(), but when draining the subtree in bdrv_reopen(). This is
only fixed in "blockjob: Lie better in child_job_drained_poll()".

It probably doesn't make a difference, though, where exactly during
bdrv_reopen() the node is deleted. We'll crash anyway if we don't hold
the extra reference.

Kevin

Attachment: signature.asc
Description: PGP signature

Reply via email to