On Mon, Mar 18, 2024 at 12:37:19PM +0100, Kevin Wolf wrote: > Am 14.03.2024 um 15:29 hat Stefan Hajnoczi geschrieben: > > On Wed, Mar 13, 2024 at 04:30:00PM +0100, Kevin Wolf wrote: > > > Calling job_pause_point() while holding the graph reader lock > > > potentially results in a deadlock: bdrv_graph_wrlock() first drains > > > everything, including the mirror job, which pauses it. The job is only > > > unpaused at the end of the drain section, which is when the graph writer > > > lock has been successfully taken. However, if the job happens to be > > > paused at a pause point where it still holds the reader lock, the writer > > > lock can't be taken as long as the job is still paused. > > > > > > Mark job_pause_point() as GRAPH_UNLOCKED and fix mirror accordingly. > > > > > > Cc: qemu-sta...@nongnu.org > > > Buglink: https://issues.redhat.com/browse/RHEL-28125 > > > Fixes: 004915a96a7a40e942ac85e6d22518cbcd283506 > > > Signed-off-by: Kevin Wolf <kw...@redhat.com> > > > --- > > > include/qemu/job.h | 2 +- > > > block/mirror.c | 10 ++++++---- > > > 2 files changed, 7 insertions(+), 5 deletions(-) > > > > > > diff --git a/include/qemu/job.h b/include/qemu/job.h > > > index 9ea98b5927..2b873f2576 100644 > > > --- a/include/qemu/job.h > > > +++ b/include/qemu/job.h > > > @@ -483,7 +483,7 @@ void job_enter(Job *job); > > > * > > > * Called with job_mutex *not* held. > > > */ > > > -void coroutine_fn job_pause_point(Job *job); > > > +void coroutine_fn GRAPH_UNLOCKED job_pause_point(Job *job); > > > > > > /** > > > * @job: The job that calls the function. > > > diff --git a/block/mirror.c b/block/mirror.c > > > index 5145eb53e1..1bdce3b657 100644 > > > --- a/block/mirror.c > > > +++ b/block/mirror.c > > > @@ -479,9 +479,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, > > > int64_t offset, > > > return bytes_handled; > > > } > > > > > > -static void coroutine_fn GRAPH_RDLOCK mirror_iteration(MirrorBlockJob *s) > > > +static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob > > > *s) > > > { > > > - BlockDriverState *source = s->mirror_top_bs->backing->bs; > > > + BlockDriverState *source; > > > MirrorOp *pseudo_op; > > > int64_t offset; > > > /* At least the first dirty chunk is mirrored in one iteration. */ > > > @@ -489,6 +489,10 @@ static void coroutine_fn GRAPH_RDLOCK > > > mirror_iteration(MirrorBlockJob *s) > > > bool write_zeroes_ok = > > > bdrv_can_write_zeroes_with_unmap(blk_bs(s->target)); > > > int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES); > > > > > > + bdrv_graph_co_rdlock(); > > > + source = s->mirror_top_bs->backing->bs; > > > > Is bdrv_ref(source) needed here so that source cannot go away if someone > > else write locks the graph and removes it? Or maybe something else > > protects against that. Either way, please add a comment that explains > > why this is safe. > > We didn't even get to looking at this level of detail with the graph > locking work. We probably should, but this is not the only place in > mirror we need to look at then. Commit 004915a9 just took the lazy path > of taking the lock for the whole function, and it turns out that this > was wrong and causes deadlocks, so I'm reverting it and replacing it > with what other parts of the code do - the minimal thing to let it > compile. > > I think we already own a reference, we do a block_job_add_bdrv() in > mirror_start_job(). But once it changes, we have a reference to the > wrong node. So it looks to me that mirror has a problem with a changing > source node that is more fundamental than graph locking in one specific > function because it stores BDS pointers in its state. > > Active commit already freezes the backing chain between mirror_top_bs > and target, maybe other mirror jobs need to freeze the link between > mirror_top_bs and source at least. > > So I agree that it might be worth looking into this more, but I consider > it unrelated to this patch. We just go back to the state in which it has > always been before 8.2 (which might contain a latent bug that apparently > never triggered in practice) to fix a regression that we do see in > practice. > > Kevin
Okay: Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
signature.asc
Description: PGP signature