Peter Xu <[email protected]> writes:

> On Fri, Jan 23, 2026 at 02:36:28PM -0300, Fabiano Rosas wrote:
>> Peter Xu <[email protected]> writes:
>> 
>> > On Fri, Jan 23, 2026 at 09:59:35AM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <[email protected]> writes:
>> >> 
>> >> > Devices may opt-in migration FAILED notifiers to be invoked when 
>> >> > migration
>> >> > fails.  Currently, the notifications happen in migration_cleanup().  It 
>> >> > is
>> >> > normally fine, but maybe not ideal if there's dependency of the fallback
>> >> > v.s. VM starts.
>> >> >
>> >> > This patch moves the FAILED notification earlier, so that if the failure
>> >> > happened during switchover, it'll notify before VM restart.
>> >> >
>> >> 
>> >> The change to FAILED in patch 2 should come to this patch to avoid
>> >> having a window where the notification only happens at the end.
>> >
>> > Hmm.. Isn't that expected?  Even after patch 2, we still notify FAILED at
>> > the end for precopy.  It's the same for postcopy.
>> >
>> 
>> Sorry, I meant: s/at the end/after vm_start/.
>> 
>> > For a failed postcopy we have following behavior:
>> >
>> > Before patch 2
>> > ==============
>> >
>> >   - notify FAILED (during switchover)
>> >   - vm_start()
>> >   - notify FAILED (during migration_cleanup)
>> >
>> > After patch 2 
>> > =============
>> >
>> >   - vm_start()
>> >   - notify FAILED (during migration_cleanup)
>> >
>> > So patch 2 fixes the duplicate issue, and only fixes that.
>> >
>> > After patch 3
>> > =============
>> >
>> >   - notify FAILED (during migration_iteration_finish)
>> >   - vm_start()
>> >
>> > Patch 3 changes the place of FAILED notification so that it happens always
>> > before vm_start(), for both precopy and postcopy.
>> 
>> Right, my point is that with patch 3 we're establishing that the correct
>> place to notify is before vm_start().
>
> Yep, likely not strictly correctness in terms of current notifiers, but
> since Stefan may have yet another use case that may require a notifier to
> be done before vm_start(), it makes more sense for us to move, IMHO.
>
>> But after patch 2, *if* any driver actually depends on being informed of
>> failure *before* starting the VM, that will not happen. I think both
>> changes could be made at once so that this intermediate state never
>> exists.
>
> I see what you meant. I think there should have no such user.
>
> It's because we always notify FAILED at migration_cleanup() for precopy, or
> even postcopy before the cpr-exec work (before QEMU 9.0).
>
> That behavior of "notify FAILED before vm_start() for postcopy" is very
> specific and only added after commit 4af667f87c ("migration: notifier error
> checking").  IOW, before QEMU 9.0, for both precopy and postcopy we always
> notify FAILED in migration_cleanup(), never before vm_start().
>
> I mentioned this in the commit log of previous patch too, where I bet the
> additional FAILED notification added in 4af667f87c for postcopy path is an
> accident (to make it pairing with the "reused DONE", however it turns out
> we likely shouldn't do either of them..).  So I don't expect anything will
> depend on that behavior, and only for postcopy.
>
> The benefit of splitting this patch and previous one is, the previous one
> is a "fix" of duplicated notifications, hence if we need a backport that
> can be done without this one.  Said that, I don't think one should need
> it..  It should also make each commits slightly easier to follow, because
> they're fundamentally two changes.
>
> Let me know what you think after reading my explanations above.  I prefer
> the split like as-is, but I can still squash it to close the trivially
> small window that you described.  I'll make sure if merged the commit
> message contains separate discussions on two problems.

Ok, fair points.

Reviewed-by: Fabiano Rosas <[email protected]>

Reply via email to