Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-09-06 Thread Kevin Wolf
Am 05.09.2023 um 18:39 hat Kevin Wolf geschrieben:
> Am 22.08.2023 um 21:01 hat Stefan Hajnoczi geschrieben:
> > On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> > > bdrv_unref() is called by a lot of places that need to hold the graph
> > > lock (it naturally happens in the context of operations that change the
> > > graph). However, bdrv_unref() takes the graph writer lock internally, so
> > > it can't actually be called while already holding a graph lock without
> > > causing a deadlock.
> > > 
> > > bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> > > node before closing it, and draining requires that the graph is
> > > unlocked.
> > > 
> > > The solution is to defer deleting the node until we don't hold the lock
> > > any more and draining is possible again.
> > > 
> > > Note that keeping images open for longer than necessary can create
> > > problems, too: You can't open an image again before it is really closed
> > > (if image locking didn't prevent it, it would cause corruption).
> > > Reopening an image immediately happens at least during bdrv_open() and
> > > bdrv_co_create().
> > > 
> > > In order to solve this problem, make sure to run the deferred unref in
> > > bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> > > again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> > > 
> > > The output of iotest 051 is updated because the additional polling
> > > changes the order of HMP output, resulting in a new "(qemu)" prompt in
> > > the test output that was previously on a separate line and filtered out.
> > > 
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > >  include/block/block-global-state.h |  1 +
> > >  block.c|  9 +
> > >  block/graph-lock.c | 23 ---
> > >  tests/qemu-iotests/051.pc.out  |  6 +++---
> > >  4 files changed, 29 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/include/block/block-global-state.h 
> > > b/include/block/block-global-state.h
> > > index f347199bff..e570799f85 100644
> > > --- a/include/block/block-global-state.h
> > > +++ b/include/block/block-global-state.h
> > > @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char 
> > > *fmt,
> > >  void bdrv_ref(BlockDriverState *bs);
> > >  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
> > >  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> > > +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
> > >  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
> > >  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
> > >   BlockDriverState *child_bs,
> > > diff --git a/block.c b/block.c
> > > index 6376452768..9c4f24f4b9 100644
> > > --- a/block.c
> > > +++ b/block.c
> > > @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
> > >  }
> > >  }
> > >  
> > > +void bdrv_schedule_unref(BlockDriverState *bs)
> > 
> > Please add a doc comment explaining when and why this should be used.
> 
> Ok.
> 
> > > +{
> > > +if (!bs) {
> > > +return;
> > > +}
> > > +aio_bh_schedule_oneshot(qemu_get_aio_context(),
> > > +(QEMUBHFunc *) bdrv_unref, bs);
> > > +}
> > > +
> > >  struct BdrvOpBlocker {
> > >  Error *reason;
> > >  QLIST_ENTRY(BdrvOpBlocker) list;
> > > diff --git a/block/graph-lock.c b/block/graph-lock.c
> > > index 5e66f01ae8..395d387651 100644
> > > --- a/block/graph-lock.c
> > > +++ b/block/graph-lock.c
> > > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> > >  void bdrv_graph_wrunlock(void)
> > >  {
> > >  GLOBAL_STATE_CODE();
> > > -QEMU_LOCK_GUARD(&aio_context_list_lock);
> > >  assert(qatomic_read(&has_writer));
> > >  
> > > +WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > > +/*
> > > + * No need for memory barriers, this works in pair with
> > > + * the slow path of rdlock() and both take the lock.
> > > + */
> > > +qatomic_store_release(&has_writer, 0);
> > > +
> > > +/* Wake up all coroutine that are waiting to read the graph */
> > 
> > s/coroutine/coroutines/
> 
> I only changed the indentation, but I guess I can just fix it while I
> touch it.
> 
> > > +qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > > +}
> > > +
> > >  /*
> > > - * No need for memory barriers, this works in pair with
> > > - * the slow path of rdlock() and both take the lock.
> > > + * Run any BHs that were scheduled during the wrlock section and that
> > > + * callers might expect to have finished (e.g. bdrv_unref() calls). 
> > > Do this
> > 
> > Referring directly to bdrv_schedule_unref() would help make it clearer
> > what you mean.
> > 
> > > + * only after restarting coroutines so that nested event loops in 
> > > BHs don't
> > > + * deadlock if their condition relies on the c

Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-09-05 Thread Kevin Wolf
Am 22.08.2023 um 21:01 hat Stefan Hajnoczi geschrieben:
> On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> > bdrv_unref() is called by a lot of places that need to hold the graph
> > lock (it naturally happens in the context of operations that change the
> > graph). However, bdrv_unref() takes the graph writer lock internally, so
> > it can't actually be called while already holding a graph lock without
> > causing a deadlock.
> > 
> > bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> > node before closing it, and draining requires that the graph is
> > unlocked.
> > 
> > The solution is to defer deleting the node until we don't hold the lock
> > any more and draining is possible again.
> > 
> > Note that keeping images open for longer than necessary can create
> > problems, too: You can't open an image again before it is really closed
> > (if image locking didn't prevent it, it would cause corruption).
> > Reopening an image immediately happens at least during bdrv_open() and
> > bdrv_co_create().
> > 
> > In order to solve this problem, make sure to run the deferred unref in
> > bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> > again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> > 
> > The output of iotest 051 is updated because the additional polling
> > changes the order of HMP output, resulting in a new "(qemu)" prompt in
> > the test output that was previously on a separate line and filtered out.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/block/block-global-state.h |  1 +
> >  block.c|  9 +
> >  block/graph-lock.c | 23 ---
> >  tests/qemu-iotests/051.pc.out  |  6 +++---
> >  4 files changed, 29 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/block/block-global-state.h 
> > b/include/block/block-global-state.h
> > index f347199bff..e570799f85 100644
> > --- a/include/block/block-global-state.h
> > +++ b/include/block/block-global-state.h
> > @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char 
> > *fmt,
> >  void bdrv_ref(BlockDriverState *bs);
> >  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
> >  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> > +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
> >  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
> >  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
> >   BlockDriverState *child_bs,
> > diff --git a/block.c b/block.c
> > index 6376452768..9c4f24f4b9 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
> >  }
> >  }
> >  
> > +void bdrv_schedule_unref(BlockDriverState *bs)
> 
> Please add a doc comment explaining when and why this should be used.

Ok.

> > +{
> > +if (!bs) {
> > +return;
> > +}
> > +aio_bh_schedule_oneshot(qemu_get_aio_context(),
> > +(QEMUBHFunc *) bdrv_unref, bs);
> > +}
> > +
> >  struct BdrvOpBlocker {
> >  Error *reason;
> >  QLIST_ENTRY(BdrvOpBlocker) list;
> > diff --git a/block/graph-lock.c b/block/graph-lock.c
> > index 5e66f01ae8..395d387651 100644
> > --- a/block/graph-lock.c
> > +++ b/block/graph-lock.c
> > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> >  void bdrv_graph_wrunlock(void)
> >  {
> >  GLOBAL_STATE_CODE();
> > -QEMU_LOCK_GUARD(&aio_context_list_lock);
> >  assert(qatomic_read(&has_writer));
> >  
> > +WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > +/*
> > + * No need for memory barriers, this works in pair with
> > + * the slow path of rdlock() and both take the lock.
> > + */
> > +qatomic_store_release(&has_writer, 0);
> > +
> > +/* Wake up all coroutine that are waiting to read the graph */
> 
> s/coroutine/coroutines/

I only changed the indentation, but I guess I can just fix it while I
touch it.

> > +qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > +}
> > +
> >  /*
> > - * No need for memory barriers, this works in pair with
> > - * the slow path of rdlock() and both take the lock.
> > + * Run any BHs that were scheduled during the wrlock section and that
> > + * callers might expect to have finished (e.g. bdrv_unref() calls). Do 
> > this
> 
> Referring directly to bdrv_schedule_unref() would help make it clearer
> what you mean.
> 
> > + * only after restarting coroutines so that nested event loops in BHs 
> > don't
> > + * deadlock if their condition relies on the coroutine making progress.
> >   */
> > -qatomic_store_release(&has_writer, 0);
> > -
> > -/* Wake up all coroutine that are waiting to read the graph */
> > -qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > +aio_bh_poll(qemu_get_aio_context());
> 
> Ke

Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-08-22 Thread Stefan Hajnoczi
On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> bdrv_unref() is called by a lot of places that need to hold the graph
> lock (it naturally happens in the context of operations that change the
> graph). However, bdrv_unref() takes the graph writer lock internally, so
> it can't actually be called while already holding a graph lock without
> causing a deadlock.
> 
> bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> node before closing it, and draining requires that the graph is
> unlocked.
> 
> The solution is to defer deleting the node until we don't hold the lock
> any more and draining is possible again.
> 
> Note that keeping images open for longer than necessary can create
> problems, too: You can't open an image again before it is really closed
> (if image locking didn't prevent it, it would cause corruption).
> Reopening an image immediately happens at least during bdrv_open() and
> bdrv_co_create().
> 
> In order to solve this problem, make sure to run the deferred unref in
> bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> 
> The output of iotest 051 is updated because the additional polling
> changes the order of HMP output, resulting in a new "(qemu)" prompt in
> the test output that was previously on a separate line and filtered out.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/block/block-global-state.h |  1 +
>  block.c|  9 +
>  block/graph-lock.c | 23 ---
>  tests/qemu-iotests/051.pc.out  |  6 +++---
>  4 files changed, 29 insertions(+), 10 deletions(-)
> 
> diff --git a/include/block/block-global-state.h 
> b/include/block/block-global-state.h
> index f347199bff..e570799f85 100644
> --- a/include/block/block-global-state.h
> +++ b/include/block/block-global-state.h
> @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char 
> *fmt,
>  void bdrv_ref(BlockDriverState *bs);
>  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
>  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
>  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
>  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   BlockDriverState *child_bs,
> diff --git a/block.c b/block.c
> index 6376452768..9c4f24f4b9 100644
> --- a/block.c
> +++ b/block.c
> @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
>  }
>  }
>  
> +void bdrv_schedule_unref(BlockDriverState *bs)

Please add a doc comment explaining when and why this should be used.

> +{
> +if (!bs) {
> +return;
> +}
> +aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +(QEMUBHFunc *) bdrv_unref, bs);
> +}
> +
>  struct BdrvOpBlocker {
>  Error *reason;
>  QLIST_ENTRY(BdrvOpBlocker) list;
> diff --git a/block/graph-lock.c b/block/graph-lock.c
> index 5e66f01ae8..395d387651 100644
> --- a/block/graph-lock.c
> +++ b/block/graph-lock.c
> @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
>  void bdrv_graph_wrunlock(void)
>  {
>  GLOBAL_STATE_CODE();
> -QEMU_LOCK_GUARD(&aio_context_list_lock);
>  assert(qatomic_read(&has_writer));
>  
> +WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> +/*
> + * No need for memory barriers, this works in pair with
> + * the slow path of rdlock() and both take the lock.
> + */
> +qatomic_store_release(&has_writer, 0);
> +
> +/* Wake up all coroutine that are waiting to read the graph */

s/coroutine/coroutines/

> +qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> +}
> +
>  /*
> - * No need for memory barriers, this works in pair with
> - * the slow path of rdlock() and both take the lock.
> + * Run any BHs that were scheduled during the wrlock section and that
> + * callers might expect to have finished (e.g. bdrv_unref() calls). Do 
> this

Referring directly to bdrv_schedule_unref() would help make it clearer
what you mean.

> + * only after restarting coroutines so that nested event loops in BHs 
> don't
> + * deadlock if their condition relies on the coroutine making progress.
>   */
> -qatomic_store_release(&has_writer, 0);
> -
> -/* Wake up all coroutine that are waiting to read the graph */
> -qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> +aio_bh_poll(qemu_get_aio_context());

Keeping a dedicated list of BDSes pending unref seems cleaner than using
the aio_bh_poll(). There's a lot of stuff happening in the event loop
and I wonder if those other BHs might cause issues if they run at the
end of bdrv_graph_wrunlock(). The change to qemu-iotests 051's output is
an example of this, but other things could happen too (e.g. monitor
command processing).

> 

Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-08-21 Thread Kevin Wolf
Am 18.08.2023 um 18:26 hat Eric Blake geschrieben:
> On Fri, Aug 18, 2023 at 11:24:00AM -0500, Eric Blake wrote:
> > > +++ b/block/graph-lock.c
> > > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> > >  void bdrv_graph_wrunlock(void)
> > >  {
> > >  GLOBAL_STATE_CODE();
> > > -QEMU_LOCK_GUARD(&aio_context_list_lock);
> > >  assert(qatomic_read(&has_writer));
> > >  
> > > +WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > > +/*
> > > + * No need for memory barriers, this works in pair with
> > > + * the slow path of rdlock() and both take the lock.
> > > + */
> > > +qatomic_store_release(&has_writer, 0);
> > > +
> > > +/* Wake up all coroutine that are waiting to read the graph */
> > > +qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > 
> > So if I understand coroutines correctly, this says all pending
> > coroutines are now scheduled to run (or maybe they do try to run here,
> > but then immediately return control back to this coroutine to await
> > the right lock conditions since we are still in the block guarded by
> > list_lock)...
> > 
> > > +}
> > > +
> > >  /*
> > > - * No need for memory barriers, this works in pair with
> > > - * the slow path of rdlock() and both take the lock.
> > > + * Run any BHs that were scheduled during the wrlock section and that
> > > + * callers might expect to have finished (e.g. bdrv_unref() calls). 
> > > Do this
> > > + * only after restarting coroutines so that nested event loops in 
> > > BHs don't
> > > + * deadlock if their condition relies on the coroutine making 
> > > progress.
> > >   */
> > > -qatomic_store_release(&has_writer, 0);
> > > -
> > > -/* Wake up all coroutine that are waiting to read the graph */
> > > -qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > > +aio_bh_poll(qemu_get_aio_context());
> > 
> > ...and as long as the other coroutines sharing this thread don't
> > actually get to make progress until the next point at which the
> > current coroutine yields, and as long as our aio_bh_poll() doesn't
> > also include a yield point, then we are ensured that the BH has
> > completed before the next yield point in our caller.
> > 
> > There are times (like today) where I'm still trying to wrap my mind
> > about the subtle differences between true multi-threading
> > vs. cooperative coroutines sharing a single thread via the use of
> > yield points.  coroutines are cool when they can get rid of some of
> > the races that you have to worry about in true multi-threading.
> 
> That said, once we introduce multi-queue, can we ever have a scenario
> where a different iothread might be trying to access the graph and
> perform a reopen in the time while this thread has not completed the
> BH close?  Or is that protected by some other mutual exclusion (where
> the only one we have to worry about is reopen by a coroutine in the
> same thread, because all other threads are locked out of graph
> modifications)?

We don't have to worry about that one because reopen (and taking the
writer lock in general) only happen in the main thread, which is exactly
the thread that this code runs in.

The thing that we need to take into consideration is that aio_bh_poll()
could call something that wants to take the writer lock and modify the
graph again. It's not really a problem, though, because we're already
done at that point. Any readers that we resumed above will just
synchronise with the writer in the usual way and one of them will have
to wait. But there is nothing that is still waiting to be resumed and
could deadlock.

Kevin




Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-08-18 Thread Eric Blake
On Fri, Aug 18, 2023 at 11:24:00AM -0500, Eric Blake wrote:
> > +++ b/block/graph-lock.c
> > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> >  void bdrv_graph_wrunlock(void)
> >  {
> >  GLOBAL_STATE_CODE();
> > -QEMU_LOCK_GUARD(&aio_context_list_lock);
> >  assert(qatomic_read(&has_writer));
> >  
> > +WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > +/*
> > + * No need for memory barriers, this works in pair with
> > + * the slow path of rdlock() and both take the lock.
> > + */
> > +qatomic_store_release(&has_writer, 0);
> > +
> > +/* Wake up all coroutine that are waiting to read the graph */
> > +qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> 
> So if I understand coroutines correctly, this says all pending
> coroutines are now scheduled to run (or maybe they do try to run here,
> but then immediately return control back to this coroutine to await
> the right lock conditions since we are still in the block guarded by
> list_lock)...
> 
> > +}
> > +
> >  /*
> > - * No need for memory barriers, this works in pair with
> > - * the slow path of rdlock() and both take the lock.
> > + * Run any BHs that were scheduled during the wrlock section and that
> > + * callers might expect to have finished (e.g. bdrv_unref() calls). Do 
> > this
> > + * only after restarting coroutines so that nested event loops in BHs 
> > don't
> > + * deadlock if their condition relies on the coroutine making progress.
> >   */
> > -qatomic_store_release(&has_writer, 0);
> > -
> > -/* Wake up all coroutine that are waiting to read the graph */
> > -qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > +aio_bh_poll(qemu_get_aio_context());
> 
> ...and as long as the other coroutines sharing this thread don't
> actually get to make progress until the next point at which the
> current coroutine yields, and as long as our aio_bh_poll() doesn't
> also include a yield point, then we are ensured that the BH has
> completed before the next yield point in our caller.
> 
> There are times (like today) where I'm still trying to wrap my mind
> about the subtle differences between true multi-threading
> vs. cooperative coroutines sharing a single thread via the use of
> yield points.  coroutines are cool when they can get rid of some of
> the races that you have to worry about in true multi-threading.

That said, once we introduce multi-queue, can we ever have a scenario
where a different iothread might be trying to access the graph and
perform a reopen in the time while this thread has not completed the
BH close?  Or is that protected by some other mutual exclusion (where
the only one we have to worry about is reopen by a coroutine in the
same thread, because all other threads are locked out of graph
modifications)?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-08-18 Thread Eric Blake
On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> bdrv_unref() is called by a lot of places that need to hold the graph
> lock (it naturally happens in the context of operations that change the
> graph). However, bdrv_unref() takes the graph writer lock internally, so
> it can't actually be called while already holding a graph lock without
> causing a deadlock.
> 
> bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> node before closing it, and draining requires that the graph is
> unlocked.
> 
> The solution is to defer deleting the node until we don't hold the lock
> any more and draining is possible again.
> 
> Note that keeping images open for longer than necessary can create
> problems, too: You can't open an image again before it is really closed
> (if image locking didn't prevent it, it would cause corruption).
> Reopening an image immediately happens at least during bdrv_open() and
> bdrv_co_create().
> 
> In order to solve this problem, make sure to run the deferred unref in
> bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> 
> The output of iotest 051 is updated because the additional polling
> changes the order of HMP output, resulting in a new "(qemu)" prompt in
> the test output that was previously on a separate line and filtered out.
> 
> Signed-off-by: Kevin Wolf 
> ---

> +++ b/block/graph-lock.c
> @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
>  void bdrv_graph_wrunlock(void)
>  {
>  GLOBAL_STATE_CODE();
> -QEMU_LOCK_GUARD(&aio_context_list_lock);
>  assert(qatomic_read(&has_writer));
>  
> +WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> +/*
> + * No need for memory barriers, this works in pair with
> + * the slow path of rdlock() and both take the lock.
> + */
> +qatomic_store_release(&has_writer, 0);
> +
> +/* Wake up all coroutine that are waiting to read the graph */
> +qemu_co_enter_all(&reader_queue, &aio_context_list_lock);

So if I understand coroutines correctly, this says all pending
coroutines are now scheduled to run (or maybe they do try to run here,
but then immediately return control back to this coroutine to await
the right lock conditions since we are still in the block guarded by
list_lock)...

> +}
> +
>  /*
> - * No need for memory barriers, this works in pair with
> - * the slow path of rdlock() and both take the lock.
> + * Run any BHs that were scheduled during the wrlock section and that
> + * callers might expect to have finished (e.g. bdrv_unref() calls). Do 
> this
> + * only after restarting coroutines so that nested event loops in BHs 
> don't
> + * deadlock if their condition relies on the coroutine making progress.
>   */
> -qatomic_store_release(&has_writer, 0);
> -
> -/* Wake up all coroutine that are waiting to read the graph */
> -qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> +aio_bh_poll(qemu_get_aio_context());

...and as long as the other coroutines sharing this thread don't
actually get to make progress until the next point at which the
current coroutine yields, and as long as our aio_bh_poll() doesn't
also include a yield point, then we are ensured that the BH has
completed before the next yield point in our caller.

There are times (like today) where I'm still trying to wrap my mind
about the subtle differences between true multi-threading
vs. cooperative coroutines sharing a single thread via the use of
yield points.  coroutines are cool when they can get rid of some of
the races that you have to worry about in true multi-threading.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




[PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-08-17 Thread Kevin Wolf
bdrv_unref() is called by a lot of places that need to hold the graph
lock (it naturally happens in the context of operations that change the
graph). However, bdrv_unref() takes the graph writer lock internally, so
it can't actually be called while already holding a graph lock without
causing a deadlock.

bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
node before closing it, and draining requires that the graph is
unlocked.

The solution is to defer deleting the node until we don't hold the lock
any more and draining is possible again.

Note that keeping images open for longer than necessary can create
problems, too: You can't open an image again before it is really closed
(if image locking didn't prevent it, it would cause corruption).
Reopening an image immediately happens at least during bdrv_open() and
bdrv_co_create().

In order to solve this problem, make sure to run the deferred unref in
bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.

The output of iotest 051 is updated because the additional polling
changes the order of HMP output, resulting in a new "(qemu)" prompt in
the test output that was previously on a separate line and filtered out.

Signed-off-by: Kevin Wolf 
---
 include/block/block-global-state.h |  1 +
 block.c|  9 +
 block/graph-lock.c | 23 ---
 tests/qemu-iotests/051.pc.out  |  6 +++---
 4 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index f347199bff..e570799f85 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
 void bdrv_ref(BlockDriverState *bs);
 void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
 void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
+void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
 BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
  BlockDriverState *child_bs,
diff --git a/block.c b/block.c
index 6376452768..9c4f24f4b9 100644
--- a/block.c
+++ b/block.c
@@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
 }
 }
 
+void bdrv_schedule_unref(BlockDriverState *bs)
+{
+if (!bs) {
+return;
+}
+aio_bh_schedule_oneshot(qemu_get_aio_context(),
+(QEMUBHFunc *) bdrv_unref, bs);
+}
+
 struct BdrvOpBlocker {
 Error *reason;
 QLIST_ENTRY(BdrvOpBlocker) list;
diff --git a/block/graph-lock.c b/block/graph-lock.c
index 5e66f01ae8..395d387651 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
 void bdrv_graph_wrunlock(void)
 {
 GLOBAL_STATE_CODE();
-QEMU_LOCK_GUARD(&aio_context_list_lock);
 assert(qatomic_read(&has_writer));
 
+WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
+/*
+ * No need for memory barriers, this works in pair with
+ * the slow path of rdlock() and both take the lock.
+ */
+qatomic_store_release(&has_writer, 0);
+
+/* Wake up all coroutine that are waiting to read the graph */
+qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
+}
+
 /*
- * No need for memory barriers, this works in pair with
- * the slow path of rdlock() and both take the lock.
+ * Run any BHs that were scheduled during the wrlock section and that
+ * callers might expect to have finished (e.g. bdrv_unref() calls). Do this
+ * only after restarting coroutines so that nested event loops in BHs don't
+ * deadlock if their condition relies on the coroutine making progress.
  */
-qatomic_store_release(&has_writer, 0);
-
-/* Wake up all coroutine that are waiting to read the graph */
-qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
+aio_bh_poll(qemu_get_aio_context());
 }
 
 void coroutine_fn bdrv_graph_co_rdlock(void)
diff --git a/tests/qemu-iotests/051.pc.out b/tests/qemu-iotests/051.pc.out
index 4d4af5a486..7e10c5fa1b 100644
--- a/tests/qemu-iotests/051.pc.out
+++ b/tests/qemu-iotests/051.pc.out
@@ -169,11 +169,11 @@ QEMU_PROG: -device scsi-hd,drive=disk: Device needs 
media, but drive is empty
 
 Testing: -drive file=TEST_DIR/t.qcow2,if=none,node-name=disk -object 
iothread,id=thread0 -device virtio-scsi,iothread=thread0,id=virtio-scsi0 
-device scsi-hd,bus=virtio-scsi0.0,drive=disk,share-rw=on -device 
ide-hd,drive=disk,share-rw=on
 QEMU X.Y.Z monitor - type 'help' for more information
-QEMU_PROG: -device ide-hd,drive=disk,share-rw=on: Cannot change iothread of 
active block backend
+(qemu) QEMU_PROG: -device ide-hd,drive=disk,share-rw=on: Cannot change 
iothread of active block backend