On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:
> On 19.09.2024 23:11, Peter Xu wrote:
> > On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
> > > On 9.09.2024 22:03, Peter Xu wrote:
> > > > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>
> > > > > 
> > > > > load_finish SaveVMHandler allows migration code to poll whether
> > > > > a device-specific asynchronous device state loading operation had 
> > > > > finished.
> > > > > 
> > > > > In order to avoid calling this handler needlessly the device is 
> > > > > supposed
> > > > > to notify the migration code of its possible readiness via a call to
> > > > > qemu_loadvm_load_finish_ready_broadcast() while holding
> > > > > qemu_loadvm_load_finish_ready_lock.
> > > > > 
> > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com>
> > > > > ---
> > > > >    include/migration/register.h | 21 +++++++++++++++
> > > > >    migration/migration.c        |  6 +++++
> > > > >    migration/migration.h        |  3 +++
> > > > >    migration/savevm.c           | 52 
> > > > > ++++++++++++++++++++++++++++++++++++
> > > > >    migration/savevm.h           |  4 +++
> > > > >    5 files changed, 86 insertions(+)
> > > > > 
> > > > > diff --git a/include/migration/register.h 
> > > > > b/include/migration/register.h
> > > > > index 4a578f140713..44d8cf5192ae 100644
> > > > > --- a/include/migration/register.h
> > > > > +++ b/include/migration/register.h
> > > > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
> > > > >        int (*load_state_buffer)(void *opaque, char *data, size_t 
> > > > > data_size,
> > > > >                                 Error **errp);
> > > > > +    /**
> > > > > +     * @load_finish
> > > > > +     *
> > > > > +     * Poll whether all asynchronous device state loading had 
> > > > > finished.
> > > > > +     * Not called on the load failure path.
> > > > > +     *
> > > > > +     * Called while holding the qemu_loadvm_load_finish_ready_lock.
> > > > > +     *
> > > > > +     * If this method signals "not ready" then it might not be called
> > > > > +     * again until qemu_loadvm_load_finish_ready_broadcast() is 
> > > > > invoked
> > > > > +     * while holding qemu_loadvm_load_finish_ready_lock.
> > > > 
> > > > [1]
> > > > 
> > > > > +     *
> > > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > > +     * @is_finished: whether the loading had finished (output 
> > > > > parameter)
> > > > > +     * @errp: pointer to Error*, to store an error if it happens.
> > > > > +     *
> > > > > +     * Returns zero to indicate success and negative for error
> > > > > +     * It's not an error that the loading still hasn't finished.
> > > > > +     */
> > > > > +    int (*load_finish)(void *opaque, bool *is_finished, Error 
> > > > > **errp);
> > > > 
> > > > The load_finish() semantics is a bit weird, especially above [1] on 
> > > > "only
> > > > allowed to be called once if ..." and also on the locks.
> > > 
> > > The point of this remark is that a driver needs to call
> > > qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
> > > core to call its load_finish handler again.
> > > 
> > > > It looks to me vfio_load_finish() also does the final load of the 
> > > > device.
> > > > 
> > > > I wonder whether that final load can be done in the threads,
> > > 
> > > Here, the problem is that current VFIO VMState has to be loaded from the 
> > > main
> > > migration thread as it internally calls QEMU core address space 
> > > modification
> > > methods which explode if called from another thread(s).
> > 
> > Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
> > BQL if possible, when that's ready then in your case here IIUC you can
> > simply take BQL in whichever thread that loads it.. but yeah it's not ready
> > at least..
> 
> Yeah, long term we might want to work on making these QEMU core address space
> modification methods somehow callable from multiple threads but that's
> definitely not something for the initial patch set.
> 
> > Would it be possible vfio_save_complete_precopy_async_thread_config_state()
> > be done in VFIO's save_live_complete_precopy() through the main channel
> > somehow?  IOW, does it rely on iterative data to be fetched first from
> > kernel, or completely separate states?
> 
> The device state data needs to be fully loaded first before "activating"
> the device by loading its config state.
> 
> > And just curious: how large is it
> > normally (and I suppose this decides whether it's applicable to be sent via
> > the main channel at all..)?
> 
> Config data is *much* smaller than device state data - as far as I remember
> it was on order of kilobytes.
> 
> > > 
> > > > then after
> > > > everything loaded the device post a semaphore telling the main thread to
> > > > continue.  See e.g.:
> > > > 
> > > >       if (migrate_switchover_ack()) {
> > > >           qemu_loadvm_state_switchover_ack_needed(mis);
> > > >       }
> > > > 
> > > > IIUC, VFIO can register load_complete_ack similarly so it only 
> > > > sem_post()
> > > > when all things are loaded?  We can then get rid of this slightly 
> > > > awkward
> > > > interface.  I had a feeling that things can be simplified (e.g., if the
> > > > thread will take care of loading the final vmstate then the mutex is 
> > > > also
> > > > not needed? etc.).
> > > 
> > > With just a single call to switchover_ack_needed per VFIO device it would
> > > need to do a blocking wait for the device buffers and config state load
> > > to finish, therefore blocking other VFIO devices from potentially loading
> > > their config state if they are ready to begin this operation earlier.
> > 
> > I am not sure I get you here, loading VFIO device states (I mean, the
> > non-iterable part) will need to be done sequentially IIUC due to what you
> > said and should rely on BQL, so I don't know how that could happen
> > concurrently for now.  But I think indeed BQL is a problem.
> Consider that we have two VFIO devices (A and B), with the following order
> of switchover_ack_needed handler calls for them: first A get this call,
> once the call for A finishes then B gets this call.
> 
> Now consider what happens if B had loaded all its buffers (in the loading
> thread) and it is ready for its config load before A finished loading its
> buffers.
> 
> B has to wait idle in this situation (even though it could have been already
> loading its config) since the switchover_ack_needed handler for A won't
> return until A is fully done.

This sounds like a performance concern, and I wonder how much this impacts
the real workload (that you run a test and measure, with/without such
concurrency) when we can save two devices in parallel anyway; I would
expect the real diff is small due to the fact I mentioned that we save >1
VFIO devices concurrently via multifd.

Do you think we can start with a simpler approach?

So what I'm thinking could be very clean is, we just discussed about
MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach.  I
wonder when with it why not we move one step further to have
MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done,
ready to send non-iterable".  It can be controlled by the same migration
property so we only send these two flags in 9.2+ machine types.

Then IIUC VFIO can send config data through main wire (just like most of
other pci devices! which is IMHO a good fit..) and on destination VFIO
holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase.

Side note: when looking again, I really think we should cleanup some
migration switchover phase functions, e.g. I think
qemu_savevm_state_complete_precopy() parameters are pretty confusing,
especially iterable_only, even if inside it it also have some postcopy
implicit checks, urgh.. but this is not relevant to our discussion, and I
won't draft that before your series land; that can complicate stuff.

> 
> > So IMHO this recv side interface so far is the major pain that I really
> > want to avoid (comparing to the rest) in the series.  Let's see whether we
> > can come up with something better..
> > 
> > One other (probably not pretty..) idea is when waiting here in the main
> > thread it yields BQL, then other threads can take it and load the VFIO
> > final chunk of data.  But I could miss something else.
> > 
> 
> I think temporary dropping BQL deep inside migration code is similar
> to running QEMU event loop deep inside migration code (about which
> people complained in my generic thread pool implementation): it's easy
> to miss some subtle dependency/race somewhere and accidentally cause rare
> hard to debug deadlock.
> 
> That's why I think that it's ultimately probably better to make QEMU core
> address space modification methods thread safe / re-entrant instead.

Right, let's see how you think about above.

Thanks,

-- 
Peter Xu


Reply via email to