On Wed, Aug 02, 2023 at 11:36:43AM -0300, Fabiano Rosas wrote: > This function currently has a straight-forward part which is waiting > for the thread to join and a complicated part which is doing a > qemu_file_shutdown() on the return path file. > > The shutdown is tricky because all calls to qemu_file_shutdown() set > f->last_error to -EIO, which means we can never know if an error is an > actual error or if we cleanly shutdown the file previously. > > This is particularly bothersome for postcopy because it would send the > return path thread into the retry routine which would wait on the > postcopy_pause_rp_sem and consequently block the main thread. We > haven't had reports of this so I must presume we never reach here with > postcopy. > > The shutdown call is also racy because since it doesn't take the > qemu_file_lock, it could NULL-dereference if the return path thread > happens to be in the middle of the critical region at > migration_release_dst_files().
After you rework the thread model on resume, shall we move migration_release_dst_files() into the migration thread to be after the pthread_join()? I assume then we don't even need a mutex to protect it? > > Move this more complicated part of the code to a separate routine so > we can wait on the thread without all of this baggage. I think you mentioned "some nuance" on having mark_source_rp_bad() in await_return_path_close_on_source(), I did remember I tried to look into that "nuance" too a long time ago but I just forgot what was that. Great if you can share some details. > > Signed-off-by: Fabiano Rosas <faro...@suse.de> > --- > migration/migration.c | 46 +++++++++++++++++++++++++------------------ > 1 file changed, 27 insertions(+), 19 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 91bba630a8..58f09275a8 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2038,6 +2038,25 @@ static int open_return_path_on_source(MigrationState > *ms, > /* Returns 0 if the RP was ok, otherwise there was an error on the RP */ > static int await_return_path_close_on_source(MigrationState *ms) > { > + if (!ms->rp_state.rp_thread_created) { > + return 0; > + } > + > + trace_await_return_path_close_on_source_joining(); > + qemu_thread_join(&ms->rp_state.rp_thread); > + ms->rp_state.rp_thread_created = false; > + trace_await_return_path_close_on_source_close(); > + return ms->rp_state.error; > +} > + > +static int close_return_path_on_source(MigrationState *ms) > +{ > + int ret; > + > + if (!ms->rp_state.rp_thread_created) { > + return 0; > + } Can we still rely on the await_return_path_close_on_source() check, so as to dedup this one? > + > /* > * If this is a normal exit then the destination will send a SHUT and the > * rp_thread will exit, however if there's an error we need to cause > @@ -2051,11 +2070,12 @@ static int > await_return_path_close_on_source(MigrationState *ms) > qemu_file_shutdown(ms->rp_state.from_dst_file); > mark_source_rp_bad(ms); > } > - trace_await_return_path_close_on_source_joining(); > - qemu_thread_join(&ms->rp_state.rp_thread); > - ms->rp_state.rp_thread_created = false; > - trace_await_return_path_close_on_source_close(); > - return ms->rp_state.error; > + > + trace_migration_return_path_end_before(); > + ret = await_return_path_close_on_source(ms); > + trace_migration_return_path_end_after(ret); > + > + return ret; > } > > static inline void > @@ -2351,20 +2371,8 @@ static void migration_completion(MigrationState *s) > goto fail; > } > > - /* > - * If rp was opened we must clean up the thread before > - * cleaning everything else up (since if there are no failures > - * it will wait for the destination to send it's status in > - * a SHUT command). > - */ > - if (s->rp_state.rp_thread_created) { > - int rp_error; > - trace_migration_return_path_end_before(); > - rp_error = await_return_path_close_on_source(s); > - trace_migration_return_path_end_after(rp_error); > - if (rp_error) { > - goto fail; > - } > + if (close_return_path_on_source(s)) { > + goto fail; > } > > if (qemu_file_get_error(s->to_dst_file)) { > -- > 2.35.3 > -- Peter Xu