* Juraj Marcin (jmar...@redhat.com) wrote: > Hi Dave, > > On 2025-09-01 17:57, Dr. David Alan Gilbert wrote: > > * Peter Xu (pet...@redhat.com) wrote: > > > On Thu, Aug 14, 2025 at 05:42:23PM +0200, Juraj Marcin wrote: > > > > Fair point, I'll then continue with the PING/PONG solution, the first > > > > implementation I have seems to be working to resolve Issue 1. > > > > > > > > For rarer split brain, we'll rely on block device locks/mgmt to resolve > > > > and change the failure handling, so it registers errors from disk > > > > activation. > > > > > > > > As tested, there should be no problems with the destination > > > > transitioning to POSTCOPY_PAUSED, since the VM was not started yet. > > > > > > > > However, to prevent the source side from transitioning to > > > > POSTCOPY_PAUSED, I think adding a new state is still the best option. > > > > > > > > I tried keeping the migration states as they are now and just rely on an > > > > attribute of MigrationState if 3rd PONG was received, however, this > > > > collides with (at least) migrate_pause tests, that are waiting for > > > > POSTCOPY_ACTIVE, and then pause the migration triggering the source to > > > > resume. We could maybe work around it by waiting for the 3rd pong > > > > instead, but I am not sure if it is possible from tests, or by not > > > > resuming if migrate_pause command is executed? > > > > > > > > I also tried extending the span of the DEVICE state, but some functions > > > > behave differently depending on if they are in postcopy or not, using > > > > the migration_in_postcopy() function, but adding the DEVICE there isn't > > > > working either. And treating the DEVICE state sometimes as postcopy and > > > > sometimes as not seems just too messy, if it would even be possible. > > > > > > Yeah, it might indeed be a bit messy. > > > > > > Is it possible to find a middle ground? E.g. add postcopy-setup status, > > > but without any new knob to enable it? Just to describe the period of > > > time > > > where dest QEMU haven't started running but started loading device states. > > > > > > The hope is libvirt (which, AFAIU, always enables the "events" capability) > > > can ignore the new postcopy-setup status transition, then maybe we can > > > also > > > introduce the postcopy-setup and make it always appear. > > > > When the destination is started with '-S' (autostart=false), which is what > > I think libvirt does, doesn't management only start the destination > > after a certain useful event? > > In other words, is there an event we already emit to say that the > > destination > > has finished loading the postcopy devices, or could we just add that > > event, so that management could just wait for that before issuing > > the continue? > > I am not aware of any such event on the destination side. When postcopy > (and its switchower) starts, the destination transitions from ACTIVE > directly to POSTCOPY_ACTIVE in the listen thread while devices are > loaded concurrently by the main thread. > > There is DEVICE state on the source side, but that is used only on the > source side when device state is being collected. When device state is > being loaded on the destination, the source side is also already in > POSTCOPY_ACTIVE state.
So I wonder what libvirt uses to trigger it starting the destination in the postcopy case? It's got to be after the device state has loaded. Dave > Best regards, > > Juraj Marcin > > > > > Dave > > > > > Thanks, > > > > > > -- > > > Peter Xu > > > > > > > > -- > > -----Open up your eyes, open up your mind, open up your code ------- > > / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ > > \ dave @ treblig.org | | In Hex / > > \ _________________________|_____ http://www.treblig.org |_______/ > > > > -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ \ dave @ treblig.org | | In Hex / \ _________________________|_____ http://www.treblig.org |_______/