From: Michael Qiu <qiud...@huayun.com> Currently, if guest has workloads, IO thread will acquire aio_context lock before do io_submit, it leads to segmentfault when do block commit after snapshot. Just like below: [Switching to thread 2 (Thread 0x7f046c312700 (LWP 108791))] #0 0x00005573f57930db in bdrv_mirror_top_pwritev ... at block/mirror.c:1420 1420 in block/mirror.c (gdb) p s->job $17 = (MirrorBlockJob *) 0x0 (gdb) p s->stop $18 = false (gdb) (gdb) bt #0 0x00005573f57930db in bdrv_mirror_top_pwritev ... at block/mirror.c:1420 #1 0x00005573f5798ceb in bdrv_driver_pwritev ... at block/io.c:1183 #2 0x00005573f579ae7a in bdrv_aligned_pwritev ... at block/io.c:1980 #3 0x00005573f579b667 in bdrv_co_pwritev_part ... at block/io.c:2137 #4 0x00005573f57886c8 in blk_do_pwritev_part ... at block/block-backend.c:1231 #5 0x00005573f578879d in blk_aio_write_entry ... at block/block-backend.c:1439 #6 0x00005573f58317cb in coroutine_trampoline ... at util/coroutine-ucontext.c:115 #7 0x00007f047414a0d0 in __start_context () at /lib64/libc.so.6 #8 0x00007f046c310e60 in () #9 0x0000000000000000 in ()
Switch to qmp: #0 0x00007f04744dd4ed in __lll_lock_wait () at /lib64/libpthread.so.0 #1 0x00007f04744d8de6 in _L_lock_941 () at /lib64/libpthread.so.0 #2 0x00007f04744d8cdf in pthread_mutex_lock () at /lib64/libpthread.so.0 #3 0x00005573f581de89 in qemu_mutex_lock_impl ... at util/qemu-thread-posix.c:78 #4 0x00005573f575789e in block_job_add_bdrv ... at blockjob.c:223 #5 0x00005573f5757ebd in block_job_create ... at blockjob.c:441 #6 0x00005573f5792430 in mirror_start_job ... at block/mirror.c:1604 #7 0x00005573f5794b6f in commit_active_start ... at block/mirror.c:1789 in IO thread when do bdrv_mirror_top_pwritev, the job is NULL, and stop field is false, this means the s object has not been initialized, and this object is initialized by block_job_create(), but the initialize process stuck in acquire the lock. The rootcause is that qemu do release/acquire when hold the lock, at the same time, IO thread get the lock after release stage, and the crash occured. Actually, in this situation, job->job.aio_context will not equal to qemu_get_aio_context(), and will be the same as bs->aio_context, thus, no need to release the lock, becasue bdrv_root_attach_child() will not change the context. This patch fix this issue. Signed-off-by: Michael Qiu <qiud...@huayun.com> --- blockjob.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/blockjob.c b/blockjob.c index c6e20e2f..e1d41db9 100644 --- a/blockjob.c +++ b/blockjob.c @@ -214,12 +214,14 @@ int block_job_add_bdrv(BlockJob *job, const char *name, BlockDriverState *bs, BdrvChild *c; bdrv_ref(bs); - if (job->job.aio_context != qemu_get_aio_context()) { + if (bdrv_get_aio_context(bs) != job->job.aio_context && + job->job.aio_context != qemu_get_aio_context()) { aio_context_release(job->job.aio_context); } c = bdrv_root_attach_child(bs, name, &child_job, job->job.aio_context, perm, shared_perm, job, errp); - if (job->job.aio_context != qemu_get_aio_context()) { + if (bdrv_get_aio_context(bs) != job->job.aio_context && + job->job.aio_context != qemu_get_aio_context()) { aio_context_acquire(job->job.aio_context); } if (c == NULL) { -- 2.22.0