migration: Parallelize device state transitions

Maciej S. Szmigiero Wed, 03 Jun 2026 14:36:36 -0700

On 2.06.2026 22:17, Peter Xu wrote:

On Tue, Jun 02, 2026 at 12:01:32PM +0200, Maciej S. Szmigiero wrote:

On 1.06.2026 21:33, Peter Xu wrote:

On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:

A bit idiosyncratic, but I don't have any better suggestions.


One thing I don't think proper with the current approach: if it's a real
"priority" it should be invoked always with high->low order, no matter if
running=true or not..  but that's not what will happen if we encode it with
the "depth" field.

That is an implication that we have two demands, and they aren't the same,
but this solution wrongly packed them together.

In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb().  That is
a real impl of prority queues with only two: prepare_cb() always has higher
priority than the cb() itself.

Can VFIO simply leverage this existing interface, instead of hijacking
"depth" in an unwanted way?

I believe the _P2P states still need to happen first for all vfio devices,
then it can do the next step. Logically I think it can also be done
internally within VFIO by proper impl of these two callbacks, to achieve
both:

- Proper transition to P2P states first for all devices, meanwhile,
- Proper concurrency of all VFIO devices on device state transitions

Would that work in a cleaner way?


In the VFIO migration save/source code we have two transitions:
PRE_COPY -> PRE_COPY_P2P done via prepare_cb() and
PRE_COPY_P2P -> STOP_COPY done via cb().
On the load/target side the situation is similar, only states are
different.

For each of these transitions we need two callbacks: one has to launch
per-VFIO device state change thread and the other one has to "collect"
(or join) these threads.

All handlers of the first callback have to be complete before calling
any handler of the second callback since otherwise there would be
VFIO device vs VFIO device serialization.

Moreover, all the prepare_cb()-related transitions have to be done
before any of the cb()-related is started.


I presume what you suggest above is to launch per-VFIO device state
transition thread in prepare_cb() which would then be responsible for
doing *both* state transitions for this VFIO device.
And then collect this thread in cb().


Yes.


However, this would need some extra interlocking:
- A lock inside VFIO code shared by all VFIO devices making sure
no VFIO device starts the second transition (the cb() one) before
all VFIO devices finish the first transition (the prepare_cb() one).


Yes, more below.


- Looking at the current QEMU code, there are some non-VFIO cb()
handlers registered via qdev_add_vm_change_state_handler() and
quite a few handlers registered via qemu_add_vm_change_state_handler().

I presume these VFIO prepare_cb()-related transitions need to finish
before calling any of these handlers to maintain the current ordering


IIUC, we don't necessarily need to keep this ordering; or do you see any
real demand of that?

What I see is that the order is only introduced due to P2P state of VFIO,
it seems very specific to me.


While I don't have a specific example why relaxing this ordering is unsafe
I am not certain that it is safe either - we can end up introducing subtle
race condition and possible guest memory corruption.

Maybe Alex and Cédric can provide some additional input here.

If nobody can authoritatively say that calling non-VFIO cb()
VM state change handlers before being certain that all VFIO devices
reached PRE_COPY_P2P / RUNNING_P2P state is indeed safe thing to do
I think we'll have to play safe here.

so some kind of interlocking in vm_state_notify() between calling
prepare_cb() and cb() may be necessary since any "prio" based
ordering will have the same problem of reverse calling order on VM
start vs stop.


To be clear, I'm not insisting on the implementation from this patch -
any alternative one that preserves the aforementioned ordering rules
and has similar downtime performance is probably okay too.


The problem is, IIUC patch 1 needs better justification on its own, and I
feel like it's not doing the right thing, as discussed previously. So even
if we want to introduce a new priority-based approach, we may want to
rethink patch 1.

IMHO it'll be perfect if we can simply reuse prepare_cb().  It seems
working.

Due to this demand, I got to learn we have pthread_barrier_init(), but..
personally I like a simple atomic+event approach:

vfio thread:

   # Do step 1, ->P2P
   if (qatomic_fetch_inc(&vfio_p2p_done) + 1 == vfio_total_devices) {
       qemu_event_set(&vfio_p2p_done_event);
   }
   qemu_event_wait(&vfio_p2p_done_event);
   # Do step 2, ->STOPCOPY

cb() only join()s.

Would this work?


At the first glance it should work - of course, we'd need to introduce
the vfio_total_devices counter too.

However the issue about synchronization with other (non-VFIO) cb()
handlers described above still (possibly) remains.

Thanks,


Thanks,
Maciej

Re: [PATCH 2/2] vfio/migration: Parallelize device state transitions

Reply via email to