Peter Xu <[email protected]> writes: > On Fri, Jan 23, 2026 at 02:36:28PM -0300, Fabiano Rosas wrote: >> Peter Xu <[email protected]> writes: >> >> > On Fri, Jan 23, 2026 at 09:59:35AM -0300, Fabiano Rosas wrote: >> >> Peter Xu <[email protected]> writes: >> >> >> >> > Devices may opt-in migration FAILED notifiers to be invoked when >> >> > migration >> >> > fails. Currently, the notifications happen in migration_cleanup(). It >> >> > is >> >> > normally fine, but maybe not ideal if there's dependency of the fallback >> >> > v.s. VM starts. >> >> > >> >> > This patch moves the FAILED notification earlier, so that if the failure >> >> > happened during switchover, it'll notify before VM restart. >> >> > >> >> >> >> The change to FAILED in patch 2 should come to this patch to avoid >> >> having a window where the notification only happens at the end. >> > >> > Hmm.. Isn't that expected? Even after patch 2, we still notify FAILED at >> > the end for precopy. It's the same for postcopy. >> > >> >> Sorry, I meant: s/at the end/after vm_start/. >> >> > For a failed postcopy we have following behavior: >> > >> > Before patch 2 >> > ============== >> > >> > - notify FAILED (during switchover) >> > - vm_start() >> > - notify FAILED (during migration_cleanup) >> > >> > After patch 2 >> > ============= >> > >> > - vm_start() >> > - notify FAILED (during migration_cleanup) >> > >> > So patch 2 fixes the duplicate issue, and only fixes that. >> > >> > After patch 3 >> > ============= >> > >> > - notify FAILED (during migration_iteration_finish) >> > - vm_start() >> > >> > Patch 3 changes the place of FAILED notification so that it happens always >> > before vm_start(), for both precopy and postcopy. >> >> Right, my point is that with patch 3 we're establishing that the correct >> place to notify is before vm_start(). > > Yep, likely not strictly correctness in terms of current notifiers, but > since Stefan may have yet another use case that may require a notifier to > be done before vm_start(), it makes more sense for us to move, IMHO. > >> But after patch 2, *if* any driver actually depends on being informed of >> failure *before* starting the VM, that will not happen. I think both >> changes could be made at once so that this intermediate state never >> exists. > > I see what you meant. I think there should have no such user. > > It's because we always notify FAILED at migration_cleanup() for precopy, or > even postcopy before the cpr-exec work (before QEMU 9.0). > > That behavior of "notify FAILED before vm_start() for postcopy" is very > specific and only added after commit 4af667f87c ("migration: notifier error > checking"). IOW, before QEMU 9.0, for both precopy and postcopy we always > notify FAILED in migration_cleanup(), never before vm_start(). > > I mentioned this in the commit log of previous patch too, where I bet the > additional FAILED notification added in 4af667f87c for postcopy path is an > accident (to make it pairing with the "reused DONE", however it turns out > we likely shouldn't do either of them..). So I don't expect anything will > depend on that behavior, and only for postcopy. > > The benefit of splitting this patch and previous one is, the previous one > is a "fix" of duplicated notifications, hence if we need a backport that > can be done without this one. Said that, I don't think one should need > it.. It should also make each commits slightly easier to follow, because > they're fundamentally two changes. > > Let me know what you think after reading my explanations above. I prefer > the split like as-is, but I can still squash it to close the trivially > small window that you described. I'll make sure if merged the commit > message contains separate discussions on two problems.
Ok, fair points. Reviewed-by: Fabiano Rosas <[email protected]>
