Re: [PATCH 1/3] migration: Stop marking RP bad after shutdown
Fabiano Rosas writes: > Peter Xu writes: > >> On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote: >>> When waiting for the return path (RP) thread to finish, there is >>> really nothing wrong in the RP if the destination end of the migration >>> stops responding, leaving it stuck. >>> >>> Stop returning an error at that point and leave it to other parts of >>> the code to catch. One such part is the very next routine run by >>> migration_completion() which checks 'to_dst_file' for an error and fails >>> the migration. Another is the RP thread itself when the recvmsg() >>> returns an error. >>> >>> With this we stop marking RP bad from outside of the thread and can >>> reuse await_return_path_close_on_source() in the next patches to wait >>> on the thread during a paused migration. >>> >>> Signed-off-by: Fabiano Rosas >>> --- >>> migration/migration.c | 1 - >>> 1 file changed, 1 deletion(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index 91bba630a8..051067f8c5 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -2049,7 +2049,6 @@ static int >>> await_return_path_close_on_source(MigrationState *ms) >>> * waiting for the destination. >>> */ >>> qemu_file_shutdown(ms->rp_state.from_dst_file); >>> -mark_source_rp_bad(ms); >>> } >>> trace_await_return_path_close_on_source_joining(); >>> qemu_thread_join(&ms->rp_state.rp_thread); >> >> The retval of await_return_path_close_on_source() relies on >> ms->rp_state.error. If mark_source_rp_bad() is dropped, is it possible >> that it'll start to return succeed where it used to return failure? > > Yep, as described in the commit message, I think it's ok to do that. The > critical part is doing the shutdown. Other instances of > mark_source_rp_bad() continue existing and we continue returning > rp_state.error. > >> >> Maybe not a big deal: I see migration_completion() also has another >> qemu_file_get_error() later to catch errors, but I don't know how solid >> that is. > > That is the instance I refer to in the commit message. At > await_return_path_close_on_source() we only call mark_source_rp_bad() if > to_dst_file has an error. That will be caught by this > qemu_file_get_error() anyway. Actually, I can do better, I can merge this shutdown() into migration_completion(). Then this dependency becomes explicit. Since you suggested moving await_return_path_close_on_source() into postcopy_pause(), it doesn't make sense to check to_dst_file anymore, because when pausing we clear that file.
Re: [PATCH 1/3] migration: Stop marking RP bad after shutdown
Peter Xu writes: > On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote: >> When waiting for the return path (RP) thread to finish, there is >> really nothing wrong in the RP if the destination end of the migration >> stops responding, leaving it stuck. >> >> Stop returning an error at that point and leave it to other parts of >> the code to catch. One such part is the very next routine run by >> migration_completion() which checks 'to_dst_file' for an error and fails >> the migration. Another is the RP thread itself when the recvmsg() >> returns an error. >> >> With this we stop marking RP bad from outside of the thread and can >> reuse await_return_path_close_on_source() in the next patches to wait >> on the thread during a paused migration. >> >> Signed-off-by: Fabiano Rosas >> --- >> migration/migration.c | 1 - >> 1 file changed, 1 deletion(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index 91bba630a8..051067f8c5 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -2049,7 +2049,6 @@ static int >> await_return_path_close_on_source(MigrationState *ms) >> * waiting for the destination. >> */ >> qemu_file_shutdown(ms->rp_state.from_dst_file); >> -mark_source_rp_bad(ms); >> } >> trace_await_return_path_close_on_source_joining(); >> qemu_thread_join(&ms->rp_state.rp_thread); > > The retval of await_return_path_close_on_source() relies on > ms->rp_state.error. If mark_source_rp_bad() is dropped, is it possible > that it'll start to return succeed where it used to return failure? Yep, as described in the commit message, I think it's ok to do that. The critical part is doing the shutdown. Other instances of mark_source_rp_bad() continue existing and we continue returning rp_state.error. > > Maybe not a big deal: I see migration_completion() also has another > qemu_file_get_error() later to catch errors, but I don't know how solid > that is. That is the instance I refer to in the commit message. At await_return_path_close_on_source() we only call mark_source_rp_bad() if to_dst_file has an error. That will be caught by this qemu_file_get_error() anyway.
Re: [PATCH 1/3] migration: Stop marking RP bad after shutdown
On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote: > When waiting for the return path (RP) thread to finish, there is > really nothing wrong in the RP if the destination end of the migration > stops responding, leaving it stuck. > > Stop returning an error at that point and leave it to other parts of > the code to catch. One such part is the very next routine run by > migration_completion() which checks 'to_dst_file' for an error and fails > the migration. Another is the RP thread itself when the recvmsg() > returns an error. > > With this we stop marking RP bad from outside of the thread and can > reuse await_return_path_close_on_source() in the next patches to wait > on the thread during a paused migration. > > Signed-off-by: Fabiano Rosas > --- > migration/migration.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 91bba630a8..051067f8c5 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2049,7 +2049,6 @@ static int > await_return_path_close_on_source(MigrationState *ms) > * waiting for the destination. > */ > qemu_file_shutdown(ms->rp_state.from_dst_file); > -mark_source_rp_bad(ms); > } > trace_await_return_path_close_on_source_joining(); > qemu_thread_join(&ms->rp_state.rp_thread); The retval of await_return_path_close_on_source() relies on ms->rp_state.error. If mark_source_rp_bad() is dropped, is it possible that it'll start to return succeed where it used to return failure? Maybe not a big deal: I see migration_completion() also has another qemu_file_get_error() later to catch errors, but I don't know how solid that is. I think as long as after this patch we can fail properly on e.g. network failures for precopy when cap return-path=on, then we should be good. -- Peter Xu
[PATCH 1/3] migration: Stop marking RP bad after shutdown
When waiting for the return path (RP) thread to finish, there is really nothing wrong in the RP if the destination end of the migration stops responding, leaving it stuck. Stop returning an error at that point and leave it to other parts of the code to catch. One such part is the very next routine run by migration_completion() which checks 'to_dst_file' for an error and fails the migration. Another is the RP thread itself when the recvmsg() returns an error. With this we stop marking RP bad from outside of the thread and can reuse await_return_path_close_on_source() in the next patches to wait on the thread during a paused migration. Signed-off-by: Fabiano Rosas --- migration/migration.c | 1 - 1 file changed, 1 deletion(-) diff --git a/migration/migration.c b/migration/migration.c index 91bba630a8..051067f8c5 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2049,7 +2049,6 @@ static int await_return_path_close_on_source(MigrationState *ms) * waiting for the destination. */ qemu_file_shutdown(ms->rp_state.from_dst_file); -mark_source_rp_bad(ms); } trace_await_return_path_close_on_source_joining(); qemu_thread_join(&ms->rp_state.rp_thread); -- 2.35.3