25.03.2020 18:50, Stefan Reiter wrote:
backup_clean is only ever called as a handler via job_exit, which
Hmm.. I'm afraid it's not quite correct.
job_clean
job_finalize_single
job_completed_txn_abort (lock aio context)
job_do_finalize
Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio
context..
And on the same time, it directaly calls job_txn_apply(job->txn,
job_finalize_single)
without locking. Is it a bug?
And, even if job_do_finalize called always with locked context, where is
guarantee that all
context of all jobs in txn are locked?
Still, let's look through its callers.
job_finalize
qmp_block_job_finalize (lock aio context)
qmp_job_finalize (lock aio context)
test_cancel_concluded (doesn't lock, but it's a test)
job_completed_txn_success
job_completed
job_exit (lock aio context)
job_cancel
blockdev_mark_auto_del (lock aio context)
job_user_cancel
qmp_block_job_cancel (locks context)
qmp_job_cancel (locks context)
job_cancel_err
job_cancel_sync (return job_finish_sync(job,
&job_cancel_err, NULL);, job_finish_sync just calls callback)
replication_close (it's .bdrv_close.. Hmm, I
don't see context locking, where is it ?)
replication_stop (locks context)
drive_backup_abort (locks context)
blockdev_backup_abort (locks context)
job_cancel_sync_all (locks context)
cancel_common (locks context)
test_* (I don't care)
already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.
Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.
Signed-off-by: Stefan Reiter <s.rei...@proxmox.com>
Just note, that this thing were recently touched by 0abf2581717a19 , so add
Sergio (its author) to CC.
---
This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.
I've still seen it occasionally hang during a backup abort. I'm trying to figure
out why that happens, stack trace indicates a similar problem with the main
thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.
block/backup.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
static void backup_clean(Job *job)
{
BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
- AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
- aio_context_acquire(aio_context);
bdrv_backup_top_drop(s->backup_top);
- aio_context_release(aio_context);
}
void backup_do_checkpoint(BlockJob *job, Error **errp)
--
Best regards,
Vladimir