On 1.06.2026 21:33, Peter Xu wrote:
On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:
A bit idiosyncratic, but I don't have any better suggestions.
One thing I don't think proper with the current approach: if it's a real
"priority" it should be invoked always with high->low order, no matter if
running=true or not.. but that's not what will happen if we encode it with
the "depth" field.
That is an implication that we have two demands, and they aren't the same,
but this solution wrongly packed them together.
In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb(). That is
a real impl of prority queues with only two: prepare_cb() always has higher
priority than the cb() itself.
Can VFIO simply leverage this existing interface, instead of hijacking
"depth" in an unwanted way?
I believe the _P2P states still need to happen first for all vfio devices,
then it can do the next step. Logically I think it can also be done
internally within VFIO by proper impl of these two callbacks, to achieve
both:
- Proper transition to P2P states first for all devices, meanwhile,
- Proper concurrency of all VFIO devices on device state transitions
Would that work in a cleaner way?
In the VFIO migration save/source code we have two transitions:
PRE_COPY -> PRE_COPY_P2P done via prepare_cb() and
PRE_COPY_P2P -> STOP_COPY done via cb().
On the load/target side the situation is similar, only states are
different.
For each of these transitions we need two callbacks: one has to launch
per-VFIO device state change thread and the other one has to "collect"
(or join) these threads.
All handlers of the first callback have to be complete before calling
any handler of the second callback since otherwise there would be
VFIO device vs VFIO device serialization.
Moreover, all the prepare_cb()-related transitions have to be done
before any of the cb()-related is started.
I presume what you suggest above is to launch per-VFIO device state
transition thread in prepare_cb() which would then be responsible for
doing *both* state transitions for this VFIO device.
And then collect this thread in cb().
However, this would need some extra interlocking:
- A lock inside VFIO code shared by all VFIO devices making sure
no VFIO device starts the second transition (the cb() one) before
all VFIO devices finish the first transition (the prepare_cb() one).
- Looking at the current QEMU code, there are some non-VFIO cb()
handlers registered via qdev_add_vm_change_state_handler() and
quite a few handlers registered via qemu_add_vm_change_state_handler().
I presume these VFIO prepare_cb()-related transitions need to finish
before calling any of these handlers to maintain the current ordering
so some kind of interlocking in vm_state_notify() between calling
prepare_cb() and cb() may be necessary since any "prio" based
ordering will have the same problem of reverse calling order on VM
start vs stop.
To be clear, I'm not insisting on the implementation from this patch -
any alternative one that preserves the aforementioned ordering rules
and has similar downtime performance is probably okay too.
Thanks,
Thanks,
Maciej