Peter Xu <pet...@redhat.com> writes:

> This patch adds a migration state on src called "postcopy-recover-setup".
> The new state will describe the intermediate step starting from when the
> src QEMU received a postcopy recovery request, until the migration channels
> are properly established, but before the recovery process take place.
>
> The request came from Libvirt where Libvirt currently rely on the migration
> state events to detect migration state changes.  That works for most of the
> migration process but except postcopy recovery failures at the beginning.
>
> Currently postcopy recovery only has two major states:
>
>   - postcopy-paused: this is the state that both sides of QEMU will be in
>     for a long time as long as the migration channel was interrupted.
>
>   - postcopy-recover: this is the state where both sides of QEMU handshake
>     with each other, preparing for a continuation of postcopy which used to
>     be interrupted.
>
> The issue here is when the recovery port is invalid, the src QEMU will take
> the URI/channels, noticing the ports are not valid, and it'll silently keep
> in the postcopy-paused state, with no event sent to Libvirt.  In this case,
> the only thing Libvirt can do is to poll the migration status with a proper
> interval, however that's less optimal.
>
> Considering that this is the only case where Libvirt won't get a
> notification from QEMU on such events, let's add postcopy-recover-setup
> state to mimic what we have with the "setup" state of a newly initialized
> migration, describing the phase of connection establishment.
>
> With that, postcopy recovery will have two paths to go now, and either path
> will guarantee an event generated.  Now the events will look like this
> during a recovery process on src QEMU:
>
>   - Initially when the recovery is initiated on src, QEMU will go from
>     "postcopy-paused" -> "postcopy-recover-setup".  Old QEMUs don't have
>     this event.
>
>   - Depending on whether the channel re-establishment is succeeded:
>
>     - In succeeded case, src QEMU will move from "postcopy-recover-setup"
>       to "postcopy-recover".  Old QEMUs also have this event.
>
>     - In failure case, src QEMU will move from "postcopy-recover-setup" to
>       "postcopy-paused" again.  Old QEMUs don't have this event.
>
> This guarantees that Libvirt will always receive a notification for
> recovery process properly.
>
> One thing to mention is, such new status is only needed on src QEMU not
> both.  On dest QEMU, the state machine doesn't change.  Hence the events
> don't change either.  It's done like so because dest QEMU may not have an
> explicit point of setup start.  E.g., it can happen that when dest QEMUs
> doesn't use migrate-recover command to use a new URI/channel, but the old
> URI/channels can be reused in recovery, in which case the old ports simply
> can work again after the network routes are fixed up.
>
> Add a new helper postcopy_is_paused() detecting whether postcopy is still
> paused, taking RECOVER_SETUP into account too.  When using it on both
> src/dst, a slight change is done altogether to always wait for the
> semaphore before checking the status, because for both sides a sem_post()
> will be required for a recovery.
>
> Cc: Jiri Denemark <jdene...@redhat.com>
> Cc: Fabiano Rosas <faro...@suse.de>
> Cc: Prasad Pandit <ppan...@redhat.com>
> Buglink: https://issues.redhat.com/browse/RHEL-38485
> Signed-off-by: Peter Xu <pet...@redhat.com>

Reviewed-by: Fabiano Rosas <faro...@suse.de>

Reply via email to