For this version: - moved the await into postcopy_pause() as Peter suggested;
- brought back the mark_source_rp_bad call. Turns out that piece of code is filled with nuance. I just moved it aside since it doesn't make sense during pause/resume. We can tackle that when we get the chance. CI run: https://gitlab.com/farosas/qemu/-/pipelines/953420150 Also ran the switchover and preempt tests for 1000 times each on x86_64. v1: https://lore.kernel.org/r/20230728121516.16258-1-faro...@suse.de The /x86_64/migration/postcopy/preempt/recovery/plain test is sometimes failing due a segmentation fault on the migration return path. There is a race involving the retry logic of the return path and the migration resume command. The issue happens when the retry logic tries to cleanup the current return path file, but ends up cleaning the new one and trying to use it right after. Tracing shows it clearly: open_return_path_on_source <-- at migration start open_return_path_on_source_continue <-- rp thread created postcopy_pause_incoming postcopy_pause_fast_load qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (incoming) postcopy_pause_fault_thread qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (source) postcopy_pause_incoming_continued open_return_path_on_source <-- NOK, too soon postcopy_pause_continued postcopy_pause_return_path <-- too late, already operating on the new from_dst_file postcopy_pause_return_path_continued <-- will continue and crash postcopy_pause_incoming qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. postcopy_pause_incoming_continued We could solve this by adding some form of synchronization to ensure that we always do the cleanup before setting up the new file, but I find it more straight-forward to move the retry logic outside of the thread by letting it finish and starting a new thread when resuming the migration. More details on the commit message. CI run: https://gitlab.com/farosas/qemu/-/pipelines/947875609 Fabiano Rosas (2): migration: Split await_return_path_close_on_source migration: Replace the return path retry logic migration/migration.c | 110 ++++++++++++++++------------------------- migration/migration.h | 1 - migration/trace-events | 1 + 3 files changed, 44 insertions(+), 68 deletions(-) -- 2.35.3