On Tue, Jun 02, 2026 at 12:01:32PM +0200, Maciej S. Szmigiero wrote:
> On 1.06.2026 21:33, Peter Xu wrote:
> > On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:
> > > A bit idiosyncratic, but I don't have any better suggestions.
> >
> > One thing I don't think proper with the current approach: if it's a real
> > "priority" it should be invoked always with high->low order, no matter if
> > running=true or not.. but that's not what will happen if we encode it with
> > the "depth" field.
> >
> > That is an implication that we have two demands, and they aren't the same,
> > but this solution wrongly packed them together.
> >
> > In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb(). That is
> > a real impl of prority queues with only two: prepare_cb() always has higher
> > priority than the cb() itself.
> >
> > Can VFIO simply leverage this existing interface, instead of hijacking
> > "depth" in an unwanted way?
> >
> > I believe the _P2P states still need to happen first for all vfio devices,
> > then it can do the next step. Logically I think it can also be done
> > internally within VFIO by proper impl of these two callbacks, to achieve
> > both:
> >
> > - Proper transition to P2P states first for all devices, meanwhile,
> > - Proper concurrency of all VFIO devices on device state transitions
> >
> > Would that work in a cleaner way?
>
> In the VFIO migration save/source code we have two transitions:
> PRE_COPY -> PRE_COPY_P2P done via prepare_cb() and
> PRE_COPY_P2P -> STOP_COPY done via cb().
> On the load/target side the situation is similar, only states are
> different.
>
> For each of these transitions we need two callbacks: one has to launch
> per-VFIO device state change thread and the other one has to "collect"
> (or join) these threads.
>
> All handlers of the first callback have to be complete before calling
> any handler of the second callback since otherwise there would be
> VFIO device vs VFIO device serialization.
>
> Moreover, all the prepare_cb()-related transitions have to be done
> before any of the cb()-related is started.
>
>
> I presume what you suggest above is to launch per-VFIO device state
> transition thread in prepare_cb() which would then be responsible for
> doing *both* state transitions for this VFIO device.
> And then collect this thread in cb().
Yes.
>
> However, this would need some extra interlocking:
> - A lock inside VFIO code shared by all VFIO devices making sure
> no VFIO device starts the second transition (the cb() one) before
> all VFIO devices finish the first transition (the prepare_cb() one).
Yes, more below.
>
> - Looking at the current QEMU code, there are some non-VFIO cb()
> handlers registered via qdev_add_vm_change_state_handler() and
> quite a few handlers registered via qemu_add_vm_change_state_handler().
>
> I presume these VFIO prepare_cb()-related transitions need to finish
> before calling any of these handlers to maintain the current ordering
IIUC, we don't necessarily need to keep this ordering; or do you see any
real demand of that?
What I see is that the order is only introduced due to P2P state of VFIO,
it seems very specific to me.
> so some kind of interlocking in vm_state_notify() between calling
> prepare_cb() and cb() may be necessary since any "prio" based
> ordering will have the same problem of reverse calling order on VM
> start vs stop.
>
>
> To be clear, I'm not insisting on the implementation from this patch -
> any alternative one that preserves the aforementioned ordering rules
> and has similar downtime performance is probably okay too.
The problem is, IIUC patch 1 needs better justification on its own, and I
feel like it's not doing the right thing, as discussed previously. So even
if we want to introduce a new priority-based approach, we may want to
rethink patch 1.
IMHO it'll be perfect if we can simply reuse prepare_cb(). It seems
working.
Due to this demand, I got to learn we have pthread_barrier_init(), but..
personally I like a simple atomic+event approach:
vfio thread:
# Do step 1, ->P2P
if (qatomic_fetch_inc(&vfio_p2p_done) + 1 == vfio_total_devices) {
qemu_event_set(&vfio_p2p_done_event);
}
qemu_event_wait(&vfio_p2p_done_event);
# Do step 2, ->STOPCOPY
cb() only join()s.
Would this work?
Thanks,
--
Peter Xu