Peter Xu <pet...@redhat.com> writes: > On Tue, Feb 06, 2024 at 06:51:12PM -0300, Fabiano Rosas wrote: >> Based-on: 20240202102857.110210-1-pet...@redhat.com >> [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups >> https://lore.kernel.org/r/20240202102857.110210-1-pet...@redhat.com >> >> Hi, >> >> For v3 I fixed the refcounting issue spotted by Avihai. The situation >> there is a bit clunky due to historical reasons. The gist is that we >> have an assumption that channel creation never fails after p->c has >> been set, so when 'p->c == NULL' we have to unref and when 'p->c != >> NULL' the cleanup code will do the unref. > > Yes, this looks good to me. That's a good catch. > > I'll leave at least one more day for Avihai and/or Dan to have another > look. My r-b persist as of now on patch 5. > > Actually I think the conditional unref is slightly tricky, but it's not its > own fault, IMHO, OTOH it's more about a1af605bd5ad where p->c is slightly > abused. My understanding is we can avoid that conditional unref with below > patch 1 as a cleanup (on top of this series). Then patch 2 comes all > alongside.
Yes, I even wrote some code to always set p->c and leave the unref to the cleanup routine. Doing reference counting in the middle of the code like that leaves us exposed to new code relying on p->c being valid during cleanup. There's already yank and the cleanup itself which expect p->c to be valid. However, I couldn't get past the uglyness of overwriting p->c, so I kept the changes minimal for this version. I'm also wondering whether we can remove the TLS handshake thread altogether now that we moved the multifd_send_setup call into the migration thread. My (poor) understanding is that a1af605bd5ad hit the issue that the QIOTask completion would just execute after multifd_save_setup returned. Otherwise I don't see how adding a thread to an already async task would have helped. But I need to think about that a bit more. > > We don't need to rush on these, though, we should fix the thread race >first because multiple of us hit it, and all cleanups can be done >later. I said we should not merge these two changes right now, but I take that back. I'll leave it up to you, there doesn't seem to be that much impact in adding them. > > ===== > From 0830819d86e08c5175d6669505aa712a0a09717f Mon Sep 17 00:00:00 2001 > From: Peter Xu <pet...@redhat.com> > Date: Wed, 7 Feb 2024 10:08:35 +0800 > Subject: [PATCH 1/2] migration/multifd: Cleanup TLS iochannel referencing > > Commit a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to > blocking handshake") introduced a thread for TLS channels, which will > resolve the issue on blocking the main thread. However in the same commit > p->c is slightly abused just to be able to pass over the pointer "p" into > the thread. > > That's the major reason we'll need to conditionally free the io channel in > the fault paths. > > To clean it up, using a separate structure to pass over both "p" and "tioc" > in the tls handshake thread. Then we can make it a rule that p->c will > never be set until the channel is completely setup. With that, we can drop > the tricky conditional unref of the io channel in the error path. > > Signed-off-by: Peter Xu <pet...@redhat.com> > --- > migration/multifd.c | 37 +++++++++++++++++++++++-------------- > 1 file changed, 23 insertions(+), 14 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index adfe8c9a0a..4a85a6b7b3 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -873,16 +873,22 @@ out: > > static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque); > > +typedef struct { > + MultiFDSendParams *p; > + QIOChannelTLS *tioc; > +} MultiFDTLSThreadArgs; > + > static void *multifd_tls_handshake_thread(void *opaque) > { > - MultiFDSendParams *p = opaque; > - QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c); > + MultiFDTLSThreadArgs *args = opaque; > > - qio_channel_tls_handshake(tioc, > + qio_channel_tls_handshake(args->tioc, > multifd_new_send_channel_async, > - p, > + args->p, > NULL, > NULL); > + g_free(args); > + > return NULL; > } > > @@ -892,6 +898,7 @@ static bool multifd_tls_channel_connect(MultiFDSendParams > *p, > { > MigrationState *s = migrate_get_current(); > const char *hostname = s->hostname; > + MultiFDTLSThreadArgs *args; > QIOChannelTLS *tioc; > > tioc = migration_tls_client_create(ioc, hostname, errp); > @@ -906,11 +913,14 @@ static bool > multifd_tls_channel_connect(MultiFDSendParams *p, > object_unref(OBJECT(ioc)); > trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname); > qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing"); > - p->c = QIO_CHANNEL(tioc); This assignment also meant multifd_send_channel_destroy() would call object_unref on the tioc object. Removing it means qio_channel_tls_finalize() will never be called. It also means the socket channel (ioc) refcount will be decremented one too many times, due to the object_unref above^. So I think we should find a point where tioc is not needed anymore and unref it and remove the object_unref(ioc) above. Right? > + > + args = g_new0(MultiFDTLSThreadArgs, 1); > + args->tioc = tioc; > + args->p = p; > > p->tls_thread_created = true; > qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker", > - multifd_tls_handshake_thread, p, > + multifd_tls_handshake_thread, args, > QEMU_THREAD_JOINABLE); > return true; > } > @@ -923,6 +933,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p, > > migration_ioc_register_yank(ioc); > p->registered_yank = true; > + /* Setup p->c only if the channel is completely setup */ > p->c = ioc; > > p->thread_created = true; > @@ -976,14 +987,12 @@ out: > > trace_multifd_new_send_channel_async_error(p->id, local_err); > multifd_send_set_error(local_err); > - if (!p->c) { > - /* > - * If no channel has been created, drop the initial > - * reference. Otherwise cleanup happens at > - * multifd_send_channel_destroy() > - */ > - object_unref(OBJECT(ioc)); > - } > + /* > + * For error cases (TLS or non-TLS), IO channel is always freed here > + * rather than when cleanup multifd: since p->c is not set, multifd > + * cleanup code doesn't even know its existance. > + */ > + object_unref(OBJECT(ioc)); > error_free(local_err); > } > > -- > 2.43.0 > ===== > From 9e574c3216f6459e3a808096d905e2554d962cad Mon Sep 17 00:00:00 2001 > From: Peter Xu <pet...@redhat.com> > Date: Wed, 7 Feb 2024 10:24:39 +0800 > Subject: [PATCH 2/2] migration/multifd: Drop registered_yank > > With a clear definition of p->c protocol, where we only set it up if the > channel is fully established (TLS or non-TLS), registered_yank boolean will > have equal meaning of "p->c != NULL". > > Drop registered_yank by checking p->c instead. > > Signed-off-by: Peter Xu <pet...@redhat.com> This one looks good. I know it depends on the previous patch, but if you plan to add it: Reviewed-by: Fabiano Rosas <faro...@suse.de> > --- > migration/multifd.h | 2 -- > migration/multifd.c | 7 +++---- > 2 files changed, 3 insertions(+), 6 deletions(-) > > diff --git a/migration/multifd.h b/migration/multifd.h > index 8a1cad0996..b3fe27ae93 100644 > --- a/migration/multifd.h > +++ b/migration/multifd.h > @@ -78,8 +78,6 @@ typedef struct { > bool tls_thread_created; > /* communication channel */ > QIOChannel *c; > - /* is the yank function registered */ > - bool registered_yank; > /* packet allocated len */ > uint32_t packet_len; > /* guest page size */ > diff --git a/migration/multifd.c b/migration/multifd.c > index 4a85a6b7b3..278453cf84 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -648,11 +648,11 @@ static int multifd_send_channel_destroy(QIOChannel > *send) > > static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp) > { > - if (p->registered_yank) { > + if (p->c) { > migration_ioc_unregister_yank(p->c); > + multifd_send_channel_destroy(p->c); > + p->c = NULL; > } > - multifd_send_channel_destroy(p->c); > - p->c = NULL; > qemu_sem_destroy(&p->sem); > qemu_sem_destroy(&p->sem_sync); > g_free(p->name); > @@ -932,7 +932,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p, > qio_channel_set_delay(ioc, false); > > migration_ioc_register_yank(ioc); > - p->registered_yank = true; > /* Setup p->c only if the channel is completely setup */ > p->c = ioc; > > -- > 2.43.0 > ==== > > Thanks,