I decided to fix the issues with the shutdown instead of complaining about them. First 5 patches address all of the possible races I found. The only problem left to figure out is the -EIO on shutdown which will need more thought.
Patches 6 & 7 fix the original segfault. Patches 8-10 make the cleanup of migration files more predictable and centralized. I also adjusted some small things that were mentioned by Peter: - moved the rp.error = false outside of the thread; - stopped checking for errors during postcopy_pause(); - dropped the tracepoint; CI run: https://gitlab.com/farosas/qemu/-/pipelines/963407228 v2: https://lore.kernel.org/r/20230802143644.7534-1-faro...@suse.de - moved the await into postcopy_pause() as Peter suggested; - brought back the mark_source_rp_bad call. Turns out that piece of code is filled with nuance. I just moved it aside since it doesn't make sense during pause/resume. We can tackle that when we get the chance. CI run: https://gitlab.com/farosas/qemu/-/pipelines/953420150 Also ran the switchover and preempt tests for 1000 times each on x86_64. v1: https://lore.kernel.org/r/20230728121516.16258-1-faro...@suse.de The /x86_64/migration/postcopy/preempt/recovery/plain test is sometimes failing due a segmentation fault on the migration return path. There is a race involving the retry logic of the return path and the migration resume command. The issue happens when the retry logic tries to cleanup the current return path file, but ends up cleaning the new one and trying to use it right after. Tracing shows it clearly: open_return_path_on_source <-- at migration start open_return_path_on_source_continue <-- rp thread created postcopy_pause_incoming postcopy_pause_fast_load qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (incoming) postcopy_pause_fault_thread qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (source) postcopy_pause_incoming_continued open_return_path_on_source <-- NOK, too soon postcopy_pause_continued postcopy_pause_return_path <-- too late, already operating on the new from_dst_file postcopy_pause_return_path_continued <-- will continue and crash postcopy_pause_incoming qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. postcopy_pause_incoming_continued We could solve this by adding some form of synchronization to ensure that we always do the cleanup before setting up the new file, but I find it more straight-forward to move the retry logic outside of the thread by letting it finish and starting a new thread when resuming the migration. More details on the commit message. CI run: https://gitlab.com/farosas/qemu/-/pipelines/947875609 Fabiano Rosas (10): migration: Fix possible race when setting rp_state.error migration: Fix possible race when shutting return path migration: Fix possible race when checking to_dst_file for errors migration: Fix possible race when shutting down to_dst_file migration: Remove redundant cleanup of postcopy_qemufile_src migration: Consolidate return path closing code migration: Replace the return path retry logic migration: Move return path cleanup to main migration thread migration: Be consistent about shutdown of source shared files migration: Add a wrapper to cleanup migration files migration/migration.c | 228 +++++++++++++++--------------------------- migration/migration.h | 1 - util/yank.c | 2 - 3 files changed, 79 insertions(+), 152 deletions(-) -- 2.35.3