On Tue, Jun 02, 2026 at 12:01:32PM +0200, Maciej S. Szmigiero wrote:
> On 1.06.2026 21:33, Peter Xu wrote:
> > On Mon, Jun 01, 2026 at 10:39:26AM -0300, Fabiano Rosas wrote:
> > > A bit idiosyncratic, but I don't have any better suggestions.
> > 
> > One thing I don't think proper with the current approach: if it's a real
> > "priority" it should be invoked always with high->low order, no matter if
> > running=true or not..  but that's not what will happen if we encode it with
> > the "depth" field.
> > 
> > That is an implication that we have two demands, and they aren't the same,
> > but this solution wrongly packed them together.
> > 
> > In 2023, VFIO already introduced VMChangeStateEntry.prepare_cb().  That is
> > a real impl of prority queues with only two: prepare_cb() always has higher
> > priority than the cb() itself.
> > 
> > Can VFIO simply leverage this existing interface, instead of hijacking
> > "depth" in an unwanted way?
> > 
> > I believe the _P2P states still need to happen first for all vfio devices,
> > then it can do the next step. Logically I think it can also be done
> > internally within VFIO by proper impl of these two callbacks, to achieve
> > both:
> > 
> > - Proper transition to P2P states first for all devices, meanwhile,
> > - Proper concurrency of all VFIO devices on device state transitions
> > 
> > Would that work in a cleaner way?
> 
> In the VFIO migration save/source code we have two transitions:
> PRE_COPY -> PRE_COPY_P2P done via prepare_cb() and
> PRE_COPY_P2P -> STOP_COPY done via cb().
> On the load/target side the situation is similar, only states are
> different.
> 
> For each of these transitions we need two callbacks: one has to launch
> per-VFIO device state change thread and the other one has to "collect"
> (or join) these threads.
> 
> All handlers of the first callback have to be complete before calling
> any handler of the second callback since otherwise there would be
> VFIO device vs VFIO device serialization.
> 
> Moreover, all the prepare_cb()-related transitions have to be done
> before any of the cb()-related is started.
> 
> 
> I presume what you suggest above is to launch per-VFIO device state
> transition thread in prepare_cb() which would then be responsible for
> doing *both* state transitions for this VFIO device.
> And then collect this thread in cb().

Yes.

> 
> However, this would need some extra interlocking:
> - A lock inside VFIO code shared by all VFIO devices making sure
> no VFIO device starts the second transition (the cb() one) before
> all VFIO devices finish the first transition (the prepare_cb() one).

Yes, more below.

> 
> - Looking at the current QEMU code, there are some non-VFIO cb()
> handlers registered via qdev_add_vm_change_state_handler() and
> quite a few handlers registered via qemu_add_vm_change_state_handler().
> 
> I presume these VFIO prepare_cb()-related transitions need to finish
> before calling any of these handlers to maintain the current ordering

IIUC, we don't necessarily need to keep this ordering; or do you see any
real demand of that?

What I see is that the order is only introduced due to P2P state of VFIO,
it seems very specific to me.

> so some kind of interlocking in vm_state_notify() between calling
> prepare_cb() and cb() may be necessary since any "prio" based
> ordering will have the same problem of reverse calling order on VM
> start vs stop.
> 
> 
> To be clear, I'm not insisting on the implementation from this patch -
> any alternative one that preserves the aforementioned ordering rules
> and has similar downtime performance is probably okay too.

The problem is, IIUC patch 1 needs better justification on its own, and I
feel like it's not doing the right thing, as discussed previously. So even
if we want to introduce a new priority-based approach, we may want to
rethink patch 1.

IMHO it'll be perfect if we can simply reuse prepare_cb().  It seems
working.

Due to this demand, I got to learn we have pthread_barrier_init(), but..
personally I like a simple atomic+event approach:

vfio thread:

  # Do step 1, ->P2P
  if (qatomic_fetch_inc(&vfio_p2p_done) + 1 == vfio_total_devices) {            
                                                      
      qemu_event_set(&vfio_p2p_done_event);
  }                                                                  
  qemu_event_wait(&vfio_p2p_done_event);
  # Do step 2, ->STOPCOPY

cb() only join()s.

Would this work?

Thanks,

-- 
Peter Xu


Reply via email to