On Wed, May 18, 2022 at 02:28:41PM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi:
> > On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote:
> >> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
> >>> This is a new attempt to replace the need to take the AioContext lock to
> >>> protect graph modifications. In particular, we aim to remove
> >>> (or better, substitute) the AioContext around bdrv_replace_child_noperm,
> >>> since this function changes BlockDriverState's ->parents and ->children
> >>> lists.
> >>>
> >>> In the previous version, we decided to discard using subtree_drains to
> >>> protect the nodes, for different reasons: for those unfamiliar with it,
> >>> please see 
> >>> https://patchew.org/QEMU/20220301142113.163174-1-eespo...@redhat.com/
> >>
> >> I reread the thread and it's unclear to me why drain is the wrong
> >> mechanism for protecting graph modifications. We theorized a lot but
> >> ultimately is this new mechanism sufficiently different from
> >> bdrv_drained_begin()/end() to make it worth implementing?
> >>
> >> Instead of invoking .drained_begin() callbacks to stop further I/O,
> >> we're now queuing coroutines (without backpressure information that
> >> whoever is spawning I/O needs so they can stop). The writer still waits
> >> for in-flight I/O to finish, including I/O not associated with the bdrv
> >> graph we wish to modify (because rdlock is per-AioContext and unrelated
> >> to a specific graph). Is this really more lightweight than drain?
> >>
> >> If I understand correctly, the original goal was to avoid the need to
> >> hold the AioContext lock across bdrv_replace_child_noperm(). I would
> >> focus on that and use drain for now.
> >>
> >> Maybe I've missed an important point about why the new mechanism is
> >> needed?
> > 
> > Ping?
> 
> label: // read till the end to see why I wrote this here
> 
> I was hoping someone from the "No" party would answer to your question,
> because as you know we reached this same conclusion together.
> 
> We thought about dropping the drain for various reasons, the main one
> (at least as far as I understood) is that we are not sure if something
> can still happen in between drain_begin/end, and it is a little bit
> confusing to use the same mechanism to block I/O and protect the graph.

We had discussions about what could go wrong and there was a feeling
that maybe a separate mechanism is appropriate for graph modifications,
but there was no concrete reason why draining around graph modification
won't work.

If no one has a concrete reason then drain still seems like the most
promising approach to protecting graph modifications. The rwlock patch
wasn't sufficiently different from drain to have significant advantages
in my opinion.

> We then thought about implementing a rwlock. A rdlock would clarify what
> we are protecting and who is using the lock. I had a rwlock draft
> implementation sent in this thread, but this also lead to additional
> problems.
> Main problem was that this new lock would introduce nested event loops,
> that together with such locking would just create deadlocks.
> If readers are in coroutines and writers are not (because graph
> operations are not running in coroutines), we have a lot of deadlocks.
> If a writer has to take the lock, it must wait all other readers to
> finish. And it does it by internally calling AIO_WAIT_WHILE, creating
> nested event loop. We don't know what could execute when polling for
> events, and for example another writer could be resumed.
> Ideally, we want writers in coroutines too.

What is the deadlock? Do the readers depend on the writer somehow?

> Additionally, many readers are running in what we call "mixed"
> functions: usually implemented automatically with generated_co_wrapper
> tag, they let a function (usually bdrv callback) run always in a
> coroutine, creating one if necessary. For example, bdrv_flush() makes
> sure hat bs->bdrv_co_flush() is always run in a coroutine.
> Such mixed functions are used in other callbacks too, making it really
> difficult to understand if we are in a coroutine or not, and mostly
> important make rwlock usage very difficult.
> 
> Which lead us to stepping back once more and try to convert all
> BlockDriverState callbacks in coroutines. This would greatly simplify
> rwlock usage, because we could make the rwlock coroutine-frendly
> (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by
> just yielding and queuing itself in coroutine queues).
> 
> First step was then to convert all callbacks in coroutines, using
> generated_coroutine_wrapper (g_c_w).
> A typical g_c_w is implemented in this way:
>       if (qemu_in_coroutine()) {
>               callback();
>       } else { // much simplified
>               co = qemu_coroutine_create(callback);
>               bdrv_coroutine_enter(bs, co);
>               BDRV_POLL_WHILE(bs, coroutine_in_progress);
>       }
> Once all callbacks are implemented using g_c_w, we can start splitting
> the two sides of the if function to only create a coroutine when we are
> outside from a bdrv callback.
> 
> However, we immediately found a problem while starting to convert the
> first callbacks: the AioContext lock is taken around some non coroutine
> callbacks! For example, bs->bdrv_open() is always called with the
> AioContext lock taken. In addition, callbacks like bdrv_open are
> graph-modifying functions, which is probably why we are taking the
> Aiocontext lock, and they do not like to run in coroutines.
> Anyways, the real problem comes when we create a coroutine in such
> places where the AioContext lock is taken and we have a graph-modifying
> function.
> 
> bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks
>  if the coroutine is entering another context from the current (which is
> not the case for open) and if we are already in coroutine (for sure
> not). Therefore it resorts to the following calls;
>       aio_context_acquire(ctx);
>         qemu_aio_coroutine_enter(ctx, co);
>         aio_context_release(ctx);
> Which is clearly a problem, because we are taking the lock twice: once
> from the original caller of the callback, and once here due to the
> coroutine. This creates a lot of deadlock situations.

aio_context_acquire() is a recursive lock. Where is the deadlock?

Stefan

Attachment: signature.asc
Description: PGP signature

Reply via email to