On 03/05/2023 1:49, Peter Xu wrote:
External email: Use caution opening links or attachments


On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
Hello everyone,
Hi, Avihai,

=== Flow of operation ===

To use precopy initial data, the capability must be enabled in the
source.

As this capability must be supported also in the destination, a
handshake is performed during migration setup. The purpose of the
handshake is to notify the destination that precopy initial data is used
and to check if it's supported.

The handshake is done in two levels. First, a general handshake is done
with the destination migration code to notify that precopy initial data
is used. Then, for each migration user in the source that supports
precopy initial data, a handshake is done with its counterpart in the
destination:
If both support it, precopy initial data will be used for them.
If source doesn't support it, precopy initial data will not be used for
them.
If source supports it and destination doesn't, migration will be failed.

Assuming the handshake succeeded, migration starts to send precopy data
and as part of it also the initial precopy data. Initial precopy data is
just like any other precopy data and as such, migration code is not
aware of it. Therefore, it's the responsibility of the migration users
(such as VFIO devices) to notify their counterparts in the destination
that their initial precopy data has been sent (for example, VFIO
migration does it when its initial bytes reach zero).

In the destination, migration code will query each migration user that
supports precopy initial data and check if its initial data has been
loaded. If initial data has been loaded by all of them, an ACK will be
sent to the source which will now be able to complete migration when
appropriate.
I can understand why this is useful, what I'm not 100% sure is whether the
complexity is needed.  The idea seems to be that src never switchover
unless it receives a READY notification from dst.

I'm imaging below simplified and more general workflow, not sure whether it
could work for you:

   - Introduce a new cap "switchover-ready", it means whether there'll be a
     ready event sent from dst -> src for "being ready for switchover"

   - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
     handled on src showing that dest is ready for switchover. It'll be sent
     only if dest is ready for the switchover

   - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
     special device like vfio that would like to participate in the decision
     making, device can set its explicit_switchover_needed=1.  This field is
     ignored if the new cap is not set.

   - Dst qemu: when new cap set, remember how many special devices are there
     requesting explicit switchover (count of SaveVMHandlers that has the
     bit set during load setup) as switch_over_pending=N.

   - Dst qemu: Once a device thinks its fine to switchover (probably in the
     load_state() callback), it calls migration_notify_switchover_ready().
     That decreases switch_over_pending and when it hits zero, one msg
     MIG_RP_MSG_SWITCHOVER_READY will be sent to src.

Only until READY msg received on src could src switchover the precopy to
dst.

Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
more msg (dst->src).

This is based on the fact that right now we always set caps on both qemus
so I suppose it already means either both have or don't have the feature
(even if one has, not setting the cap means disabled on both).

Would it work for this case and cleaner?

Hi Peter, thanks for the response!
Your approach is indeed much simpler, however I have a few concerns regarding compatibility.

You are saying that caps are always set both in src and dest.
But what happens if we set the cap only on one side?
Should we care about these scenarios?
For example, if we set the cap only in src, then src will wait indefinitely for dest to notify that switchover is ready. Would you expect migration to fail instead of just keep running indefinitely? In current approach we only need to enable the cap in the source, so such scenario can't happen.

Let's look at some other scenario.
Src QEMU supports explicit-switchover for device X but *not* for device Y (i.e., src QEMU is some older version of QEMU that supports explicit-switchover for device X but not for Y).
Dest QEMU supports explicit-switchover for device X and device Y.
The capability is set in both src and dest.
In the destination we will have switchover_pending=2 because both X and Y support explicit-switchover. We do migration, but switchover_pending will never reach 0 because only X supports it in the source, so the migration will run indefinitely. The per-device handshake solves this by making device Y not use explicit-switchover in this case.

Thanks.


Reply via email to