On 5/21/2026 6:04 PM, Peter Xu wrote:
External email: Use caution opening links or attachments


On Thu, May 21, 2026 at 04:46:31PM +0300, Avihai Horon wrote:
On 5/19/2026 10:58 PM, Peter Xu wrote:
External email: Use caution opening links or attachments


On Tue, May 05, 2026 at 11:14:18AM +0300, Avihai Horon wrote:
When precopy initial_bytes reaches zero VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
flag is sent to the destination to indicate that initial data has been
sent, so destination can indicate back to source when it finished
loading it.

To get a more accurate estimation of initial_bytes, re-query precopy
size before sending the flag. Extract the flag sending logic from
vfio_save_iterate() to a new helper for clarity.

This may prevent premature sending of VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
flag if, for example, the previously queried initial_bytes was lower
than actually is. Additionally, it prevents sending the flag if
vfio_query_precopy_size() failed.

Signed-off-by: Avihai Horon <[email protected]>
---
   hw/vfio/migration.c  | 37 ++++++++++++++++++++++++++++++++-----
   hw/vfio/trace-events |  1 +
   2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2911583ee1..243624b5fe 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -456,6 +456,37 @@ static void 
vfio_update_estimated_pending_data(VFIOMigration *migration,
                                            data_size);
   }

+/* Returns true if the init data flag was sent, false otherwise */
+static bool vfio_send_init_data_flag(QEMUFile *f, VFIOMigration *migration)
+{
+    VFIODevice *vbasedev = migration->vbasedev;
+    int ret;
+
+    if (!migrate_switchover_ack()) {
+        return false;
+    }
+
+    if (migration->precopy_init_size || migration->initial_data_sent) {
+        return false;
+    }
[1]

+
+    /*
+     * precopy_init_size holds an estimation of the initial data size, re-query
+     * precopy size to ensure it's really zero before sending init data flag.
+     * Don't send the flag if query fails.
+     */
+    ret = vfio_query_precopy_size(migration);
+    if (ret || migration->precopy_init_size) {
+        return false;
+    }
IIUC this chunk isn't necessary?  If we don't expect REINIT to happen that
much (when NIC reconfigures?), then we can still rely on the window where
the "new switchover ack" will be requested later on during the exact sync.

Relying on that seems slightly cleaner.
Not sure I follow.

New switchover ack is requested in exact sync if we see new init_bytes > 0
(REINIT flag).
This flow happens only after the new switchover ack is requested in exact
sync, when init_bytes = 0 again.

So this chunk just makes sure we send the VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
flag at the right time.
AFAIU, what this chunk does is, we may save one switchover-ack if REINIT
got here.  It doesn't provide much functional difference in reality.

With this code there, when it happens to see REINIT, instead of sending an
immediate VFIO_MIG_FLAG_DEV_INIT_DATA_SENT message, it falls back to send
init data in the next iteration loop, saving that flag, and saving a
"request switchover-ack" on src QEMU too.

If above code removed, IIUC VFIO will send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
immediately causing dest sends ACK.  vfio_query_precopy_size() will be
postponed until the next sync query (which must happen at some point before
final switchover), then it will be collected there, VFIO src will request
for switchover-ack, then another VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is
expected.

Both should work, but what I meant is, I think we don't need this random
check, because it's optimistic, it's not functionally necessary, IIUC.

IOW, see the current code and how it can still race with a REINIT anyway:

              migration thread                                  some vfio 
driver thread

     ret = vfio_query_precopy_size(migration);
     if (ret || migration->precopy_init_size) {
         return false;
     }
                                                                    got 
reconfigured,
                                                                    set REINIT

     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
     migration->initial_data_sent = true;
     trace_vfio_send_init_data_flag(vbasedev->name);

It's the same to me if e.g. we try to vfio_query_precopy_size() in VFIO's
iterative loops from time to time, it'll also work, it'll make sync more
frequent, but it's not needed.

I see what you mean now.
However, the purpose of this chunk is not to check for another REINIT, but rather to ensure VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is sent at the right time -- when init_bytes is truly 0.

If the flag is sent before init_bytes = 0 then dest may ack too early, before it did the time-consuming load and we risk having that load during downtime (see my next response to the cover letter).

At first I thought mlx5 may report init_bytes estimate which is a bit lower than the actual value but then I realized it can't happen. However, I chose to keep this patch because it's aligned with the uapi that defines init_bytes as an estimate, and because it covers the case where precopy info ioctl fails.

Granted, if initial_bytes is so important I'd expect drivers to report a pretty accurate value, but still, it's aligned with the uapi definition and doesn't hurt.

Thanks.


Reply via email to