On Wed, May 18, 2022 at 02:28:41PM +0200, Emanuele Giuseppe Esposito wrote: > > > Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi: > > On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote: > >> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote: > >>> This is a new attempt to replace the need to take the AioContext lock to > >>> protect graph modifications. In particular, we aim to remove > >>> (or better, substitute) the AioContext around bdrv_replace_child_noperm, > >>> since this function changes BlockDriverState's ->parents and ->children > >>> lists. > >>> > >>> In the previous version, we decided to discard using subtree_drains to > >>> protect the nodes, for different reasons: for those unfamiliar with it, > >>> please see > >>> https://patchew.org/QEMU/20220301142113.163174-1-eespo...@redhat.com/ > >> > >> I reread the thread and it's unclear to me why drain is the wrong > >> mechanism for protecting graph modifications. We theorized a lot but > >> ultimately is this new mechanism sufficiently different from > >> bdrv_drained_begin()/end() to make it worth implementing? > >> > >> Instead of invoking .drained_begin() callbacks to stop further I/O, > >> we're now queuing coroutines (without backpressure information that > >> whoever is spawning I/O needs so they can stop). The writer still waits > >> for in-flight I/O to finish, including I/O not associated with the bdrv > >> graph we wish to modify (because rdlock is per-AioContext and unrelated > >> to a specific graph). Is this really more lightweight than drain? > >> > >> If I understand correctly, the original goal was to avoid the need to > >> hold the AioContext lock across bdrv_replace_child_noperm(). I would > >> focus on that and use drain for now. > >> > >> Maybe I've missed an important point about why the new mechanism is > >> needed? > > > > Ping? > > label: // read till the end to see why I wrote this here > > I was hoping someone from the "No" party would answer to your question, > because as you know we reached this same conclusion together. > > We thought about dropping the drain for various reasons, the main one > (at least as far as I understood) is that we are not sure if something > can still happen in between drain_begin/end, and it is a little bit > confusing to use the same mechanism to block I/O and protect the graph.
We had discussions about what could go wrong and there was a feeling that maybe a separate mechanism is appropriate for graph modifications, but there was no concrete reason why draining around graph modification won't work. If no one has a concrete reason then drain still seems like the most promising approach to protecting graph modifications. The rwlock patch wasn't sufficiently different from drain to have significant advantages in my opinion. > We then thought about implementing a rwlock. A rdlock would clarify what > we are protecting and who is using the lock. I had a rwlock draft > implementation sent in this thread, but this also lead to additional > problems. > Main problem was that this new lock would introduce nested event loops, > that together with such locking would just create deadlocks. > If readers are in coroutines and writers are not (because graph > operations are not running in coroutines), we have a lot of deadlocks. > If a writer has to take the lock, it must wait all other readers to > finish. And it does it by internally calling AIO_WAIT_WHILE, creating > nested event loop. We don't know what could execute when polling for > events, and for example another writer could be resumed. > Ideally, we want writers in coroutines too. What is the deadlock? Do the readers depend on the writer somehow? > Additionally, many readers are running in what we call "mixed" > functions: usually implemented automatically with generated_co_wrapper > tag, they let a function (usually bdrv callback) run always in a > coroutine, creating one if necessary. For example, bdrv_flush() makes > sure hat bs->bdrv_co_flush() is always run in a coroutine. > Such mixed functions are used in other callbacks too, making it really > difficult to understand if we are in a coroutine or not, and mostly > important make rwlock usage very difficult. > > Which lead us to stepping back once more and try to convert all > BlockDriverState callbacks in coroutines. This would greatly simplify > rwlock usage, because we could make the rwlock coroutine-frendly > (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by > just yielding and queuing itself in coroutine queues). > > First step was then to convert all callbacks in coroutines, using > generated_coroutine_wrapper (g_c_w). > A typical g_c_w is implemented in this way: > if (qemu_in_coroutine()) { > callback(); > } else { // much simplified > co = qemu_coroutine_create(callback); > bdrv_coroutine_enter(bs, co); > BDRV_POLL_WHILE(bs, coroutine_in_progress); > } > Once all callbacks are implemented using g_c_w, we can start splitting > the two sides of the if function to only create a coroutine when we are > outside from a bdrv callback. > > However, we immediately found a problem while starting to convert the > first callbacks: the AioContext lock is taken around some non coroutine > callbacks! For example, bs->bdrv_open() is always called with the > AioContext lock taken. In addition, callbacks like bdrv_open are > graph-modifying functions, which is probably why we are taking the > Aiocontext lock, and they do not like to run in coroutines. > Anyways, the real problem comes when we create a coroutine in such > places where the AioContext lock is taken and we have a graph-modifying > function. > > bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks > if the coroutine is entering another context from the current (which is > not the case for open) and if we are already in coroutine (for sure > not). Therefore it resorts to the following calls; > aio_context_acquire(ctx); > qemu_aio_coroutine_enter(ctx, co); > aio_context_release(ctx); > Which is clearly a problem, because we are taking the lock twice: once > from the original caller of the callback, and once here due to the > coroutine. This creates a lot of deadlock situations. aio_context_acquire() is a recursive lock. Where is the deadlock? Stefan
signature.asc
Description: PGP signature