On Sun, May 24, 2026 at 09:45:33AM +0300, Avihai Horon wrote:
> 
> On 5/21/2026 6:04 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, May 21, 2026 at 04:46:31PM +0300, Avihai Horon wrote:
> > > On 5/19/2026 10:58 PM, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Tue, May 05, 2026 at 11:14:18AM +0300, Avihai Horon wrote:
> > > > > When precopy initial_bytes reaches zero 
> > > > > VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
> > > > > flag is sent to the destination to indicate that initial data has been
> > > > > sent, so destination can indicate back to source when it finished
> > > > > loading it.
> > > > > 
> > > > > To get a more accurate estimation of initial_bytes, re-query precopy
> > > > > size before sending the flag. Extract the flag sending logic from
> > > > > vfio_save_iterate() to a new helper for clarity.
> > > > > 
> > > > > This may prevent premature sending of VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
> > > > > flag if, for example, the previously queried initial_bytes was lower
> > > > > than actually is. Additionally, it prevents sending the flag if
> > > > > vfio_query_precopy_size() failed.
> > > > > 
> > > > > Signed-off-by: Avihai Horon <[email protected]>
> > > > > ---
> > > > >    hw/vfio/migration.c  | 37 ++++++++++++++++++++++++++++++++-----
> > > > >    hw/vfio/trace-events |  1 +
> > > > >    2 files changed, 33 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > > > index 2911583ee1..243624b5fe 100644
> > > > > --- a/hw/vfio/migration.c
> > > > > +++ b/hw/vfio/migration.c
> > > > > @@ -456,6 +456,37 @@ static void 
> > > > > vfio_update_estimated_pending_data(VFIOMigration *migration,
> > > > >                                             data_size);
> > > > >    }
> > > > > 
> > > > > +/* Returns true if the init data flag was sent, false otherwise */
> > > > > +static bool vfio_send_init_data_flag(QEMUFile *f, VFIOMigration 
> > > > > *migration)
> > > > > +{
> > > > > +    VFIODevice *vbasedev = migration->vbasedev;
> > > > > +    int ret;
> > > > > +
> > > > > +    if (!migrate_switchover_ack()) {
> > > > > +        return false;
> > > > > +    }
> > > > > +
> > > > > +    if (migration->precopy_init_size || 
> > > > > migration->initial_data_sent) {
> > > > > +        return false;
> > > > > +    }
> > [1]
> > 
> > > > > +
> > > > > +    /*
> > > > > +     * precopy_init_size holds an estimation of the initial data 
> > > > > size, re-query
> > > > > +     * precopy size to ensure it's really zero before sending init 
> > > > > data flag.
> > > > > +     * Don't send the flag if query fails.
> > > > > +     */
> > > > > +    ret = vfio_query_precopy_size(migration);
> > > > > +    if (ret || migration->precopy_init_size) {
> > > > > +        return false;
> > > > > +    }
> > > > IIUC this chunk isn't necessary?  If we don't expect REINIT to happen 
> > > > that
> > > > much (when NIC reconfigures?), then we can still rely on the window 
> > > > where
> > > > the "new switchover ack" will be requested later on during the exact 
> > > > sync.
> > > > 
> > > > Relying on that seems slightly cleaner.
> > > Not sure I follow.
> > > 
> > > New switchover ack is requested in exact sync if we see new init_bytes > 0
> > > (REINIT flag).
> > > This flow happens only after the new switchover ack is requested in exact
> > > sync, when init_bytes = 0 again.
> > > 
> > > So this chunk just makes sure we send the VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
> > > flag at the right time.
> > AFAIU, what this chunk does is, we may save one switchover-ack if REINIT
> > got here.  It doesn't provide much functional difference in reality.
> > 
> > With this code there, when it happens to see REINIT, instead of sending an
> > immediate VFIO_MIG_FLAG_DEV_INIT_DATA_SENT message, it falls back to send
> > init data in the next iteration loop, saving that flag, and saving a
> > "request switchover-ack" on src QEMU too.
> > 
> > If above code removed, IIUC VFIO will send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
> > immediately causing dest sends ACK.  vfio_query_precopy_size() will be
> > postponed until the next sync query (which must happen at some point before
> > final switchover), then it will be collected there, VFIO src will request
> > for switchover-ack, then another VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is
> > expected.
> > 
> > Both should work, but what I meant is, I think we don't need this random
> > check, because it's optimistic, it's not functionally necessary, IIUC.
> > 
> > IOW, see the current code and how it can still race with a REINIT anyway:
> > 
> >               migration thread                                  some vfio 
> > driver thread
> > 
> >      ret = vfio_query_precopy_size(migration);
> >      if (ret || migration->precopy_init_size) {
> >          return false;
> >      }
> >                                                                     got 
> > reconfigured,
> >                                                                     set 
> > REINIT
> > 
> >      qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> >      migration->initial_data_sent = true;
> >      trace_vfio_send_init_data_flag(vbasedev->name);
> > 
> > It's the same to me if e.g. we try to vfio_query_precopy_size() in VFIO's
> > iterative loops from time to time, it'll also work, it'll make sync more
> > frequent, but it's not needed.
> 
> I see what you mean now.
> However, the purpose of this chunk is not to check for another REINIT, but
> rather to ensure VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is sent at the right time
> -- when init_bytes is truly 0.

IIUC, that part should normally be guaranteed by this line you added prior
to it:

+    if (migration->precopy_init_size || migration->initial_data_sent) {
+        return false;
+    }

Hence when reaching vfio_query_precopy_size(), precopy_init_size==0.

IOW, I think sending INIT_DATA_SENT is fine once it's guarateed the next
vfio_query_precopy_size() will revoke it by the newly populated REINIT
data.  Essentially, the race window is pretty small here.

If you still want to keep it there, I'm still fine; having this extra check
isn't making this incorrect.  I just want to make sure we're on the same
page.

Thanks,

> 
> If the flag is sent before init_bytes = 0 then dest may ack too early,
> before it did the time-consuming load and we risk having that load during
> downtime (see my next response to the cover letter).
> 
> At first I thought mlx5 may report init_bytes estimate which is a bit lower
> than the actual value but then I realized it can't happen. However, I chose
> to keep this patch because it's aligned with the uapi that defines
> init_bytes as an estimate, and because it covers the case where precopy info
> ioctl fails.
> 
> Granted, if initial_bytes is so important I'd expect drivers to report a
> pretty accurate value, but still, it's aligned with the uapi definition and
> doesn't hurt.
> 
> Thanks.
> 

-- 
Peter Xu


Reply via email to