On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote: > On 19.09.2024 23:11, Peter Xu wrote: > > On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote: > > > On 9.09.2024 22:03, Peter Xu wrote: > > > > On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote: > > > > > From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com> > > > > > > > > > > load_finish SaveVMHandler allows migration code to poll whether > > > > > a device-specific asynchronous device state loading operation had > > > > > finished. > > > > > > > > > > In order to avoid calling this handler needlessly the device is > > > > > supposed > > > > > to notify the migration code of its possible readiness via a call to > > > > > qemu_loadvm_load_finish_ready_broadcast() while holding > > > > > qemu_loadvm_load_finish_ready_lock. > > > > > > > > > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com> > > > > > --- > > > > > include/migration/register.h | 21 +++++++++++++++ > > > > > migration/migration.c | 6 +++++ > > > > > migration/migration.h | 3 +++ > > > > > migration/savevm.c | 52 > > > > > ++++++++++++++++++++++++++++++++++++ > > > > > migration/savevm.h | 4 +++ > > > > > 5 files changed, 86 insertions(+) > > > > > > > > > > diff --git a/include/migration/register.h > > > > > b/include/migration/register.h > > > > > index 4a578f140713..44d8cf5192ae 100644 > > > > > --- a/include/migration/register.h > > > > > +++ b/include/migration/register.h > > > > > @@ -278,6 +278,27 @@ typedef struct SaveVMHandlers { > > > > > int (*load_state_buffer)(void *opaque, char *data, size_t > > > > > data_size, > > > > > Error **errp); > > > > > + /** > > > > > + * @load_finish > > > > > + * > > > > > + * Poll whether all asynchronous device state loading had > > > > > finished. > > > > > + * Not called on the load failure path. > > > > > + * > > > > > + * Called while holding the qemu_loadvm_load_finish_ready_lock. > > > > > + * > > > > > + * If this method signals "not ready" then it might not be called > > > > > + * again until qemu_loadvm_load_finish_ready_broadcast() is > > > > > invoked > > > > > + * while holding qemu_loadvm_load_finish_ready_lock. > > > > > > > > [1] > > > > > > > > > + * > > > > > + * @opaque: data pointer passed to register_savevm_live() > > > > > + * @is_finished: whether the loading had finished (output > > > > > parameter) > > > > > + * @errp: pointer to Error*, to store an error if it happens. > > > > > + * > > > > > + * Returns zero to indicate success and negative for error > > > > > + * It's not an error that the loading still hasn't finished. > > > > > + */ > > > > > + int (*load_finish)(void *opaque, bool *is_finished, Error > > > > > **errp); > > > > > > > > The load_finish() semantics is a bit weird, especially above [1] on > > > > "only > > > > allowed to be called once if ..." and also on the locks. > > > > > > The point of this remark is that a driver needs to call > > > qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration > > > core to call its load_finish handler again. > > > > > > > It looks to me vfio_load_finish() also does the final load of the > > > > device. > > > > > > > > I wonder whether that final load can be done in the threads, > > > > > > Here, the problem is that current VFIO VMState has to be loaded from the > > > main > > > migration thread as it internally calls QEMU core address space > > > modification > > > methods which explode if called from another thread(s). > > > > Ahh, I see. I'm trying to make dest qemu loadvm in a thread too and yield > > BQL if possible, when that's ready then in your case here IIUC you can > > simply take BQL in whichever thread that loads it.. but yeah it's not ready > > at least.. > > Yeah, long term we might want to work on making these QEMU core address space > modification methods somehow callable from multiple threads but that's > definitely not something for the initial patch set. > > > Would it be possible vfio_save_complete_precopy_async_thread_config_state() > > be done in VFIO's save_live_complete_precopy() through the main channel > > somehow? IOW, does it rely on iterative data to be fetched first from > > kernel, or completely separate states? > > The device state data needs to be fully loaded first before "activating" > the device by loading its config state. > > > And just curious: how large is it > > normally (and I suppose this decides whether it's applicable to be sent via > > the main channel at all..)? > > Config data is *much* smaller than device state data - as far as I remember > it was on order of kilobytes. > > > > > > > > then after > > > > everything loaded the device post a semaphore telling the main thread to > > > > continue. See e.g.: > > > > > > > > if (migrate_switchover_ack()) { > > > > qemu_loadvm_state_switchover_ack_needed(mis); > > > > } > > > > > > > > IIUC, VFIO can register load_complete_ack similarly so it only > > > > sem_post() > > > > when all things are loaded? We can then get rid of this slightly > > > > awkward > > > > interface. I had a feeling that things can be simplified (e.g., if the > > > > thread will take care of loading the final vmstate then the mutex is > > > > also > > > > not needed? etc.). > > > > > > With just a single call to switchover_ack_needed per VFIO device it would > > > need to do a blocking wait for the device buffers and config state load > > > to finish, therefore blocking other VFIO devices from potentially loading > > > their config state if they are ready to begin this operation earlier. > > > > I am not sure I get you here, loading VFIO device states (I mean, the > > non-iterable part) will need to be done sequentially IIUC due to what you > > said and should rely on BQL, so I don't know how that could happen > > concurrently for now. But I think indeed BQL is a problem. > Consider that we have two VFIO devices (A and B), with the following order > of switchover_ack_needed handler calls for them: first A get this call, > once the call for A finishes then B gets this call. > > Now consider what happens if B had loaded all its buffers (in the loading > thread) and it is ready for its config load before A finished loading its > buffers. > > B has to wait idle in this situation (even though it could have been already > loading its config) since the switchover_ack_needed handler for A won't > return until A is fully done.
This sounds like a performance concern, and I wonder how much this impacts the real workload (that you run a test and measure, with/without such concurrency) when we can save two devices in parallel anyway; I would expect the real diff is small due to the fact I mentioned that we save >1 VFIO devices concurrently via multifd. Do you think we can start with a simpler approach? So what I'm thinking could be very clean is, we just discussed about MIG_CMD_SWITCHOVER and looks like you also think it's an OK approach. I wonder when with it why not we move one step further to have MIG_CMD_SEND_NON_ITERABE just to mark that "iterable devices all done, ready to send non-iterable". It can be controlled by the same migration property so we only send these two flags in 9.2+ machine types. Then IIUC VFIO can send config data through main wire (just like most of other pci devices! which is IMHO a good fit..) and on destination VFIO holds off loading them until passing the MIG_CMD_SEND_NON_ITERABE phase. Side note: when looking again, I really think we should cleanup some migration switchover phase functions, e.g. I think qemu_savevm_state_complete_precopy() parameters are pretty confusing, especially iterable_only, even if inside it it also have some postcopy implicit checks, urgh.. but this is not relevant to our discussion, and I won't draft that before your series land; that can complicate stuff. > > > So IMHO this recv side interface so far is the major pain that I really > > want to avoid (comparing to the rest) in the series. Let's see whether we > > can come up with something better.. > > > > One other (probably not pretty..) idea is when waiting here in the main > > thread it yields BQL, then other threads can take it and load the VFIO > > final chunk of data. But I could miss something else. > > > > I think temporary dropping BQL deep inside migration code is similar > to running QEMU event loop deep inside migration code (about which > people complained in my generic thread pool implementation): it's easy > to miss some subtle dependency/race somewhere and accidentally cause rare > hard to debug deadlock. > > That's why I think that it's ultimately probably better to make QEMU core > address space modification methods thread safe / re-entrant instead. Right, let's see how you think about above. Thanks, -- Peter Xu