On Wed, 22 Oct 2025 15:25:59 -0400
Peter Xu <[email protected]> wrote:

> This is v1, however not 10.2 material.  The earliest I see fit would still
> be 11.0+ even if everything goes extremely smooth.
> 
> Removal of RFC is only about that I'm more confident this should be able to
> land without breaking something too easily, as I smoked it slightly more
> cross-archs this time.  AFAIU the best (and possibly only..) way to prove
> it solid is to merge it.. likely in the early phase of a dev cycle.
> 
> The plan is we'll try to get to more device setups too soon, before it
> could land.
> 
> Background
> ==========
> 
> Nowadays, live migration heavily depends on threads. For example, most of
> the major features that will be used nowadays in live migration (multifd,
> postcopy, mapped-ram, vfio, etc.) all work with threads internally.
> 
> But still, from time to time, we'll see some coroutines floating around the
> migration context.  The major one is precopy's loadvm, which is internally
> a coroutine.  It is still a critical path that any live migration depends on.
> 
> A mixture of using both coroutines and threads is prone to issues.  Some
> examples can refer to commit e65cec5e5d ("migration/ram: Yield periodically
> to the main loop") or commit 7afbdada7e ("migration/postcopy: ensure
> preempt channel is ready before loading states").
> 
> It was a coroutine since this work (thanks to Fabiano, the archeologist,
> digging the link):
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01136.html
> 
> [...]
>
> Tests
> =====
> 
> Default CI passes.
> 
> RDMA unit tests pass as usual. I also tried out cancellation / failure
> tests over RDMA channels, making sure nothing is stuck.
> 
> I also roughly measured how long it takes to run the whole 80+ migration
> qtest suite, and see no measurable difference before / after this series.
> 
> I didn't test COLO, I wanted to but the doc example didn't work.
> 
> Risks
> =====
> 
> This series has the risk of breaking things.  I would be surprised if it
> didn't..
> 
> The current way of taking BQL during FULL section load may cause issues, it
> means when the IOs are unstable we could be waiting for IO (in the new
> migration incoming thread) with BQL held.  This is low possibility, though,
> only happens when the network halts during flushing the device states.
> However still possible.  One solution is to further breakdown the BQL
> critical sections to smaller sections, as mentioned in TODO.
> 
> Anything more than welcomed: suggestions, questions, objections, tests..
> 
> TODO
> ====
> 
> - Finer grained BQL breakdown
> 
> Peter Xu (13):
>   io: Add qio_channel_wait_cond() helper
>   migration: Properly wait on G_IO_IN when peeking messages
>   migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
>   migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
>   migration/rdma: Change io_create_watch() to return immediately
>   migration: Introduce WITH_BQL_HELD() / WITH_BQL_RELEASED()
>   migration: Pass in bql_held information from qemu_loadvm_state()
>   migration: Thread-ify precopy vmstate load process
>   migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
>   migration/postcopy: Remove workaround on wait preempt channel
>   migration/ram: Remove workaround on ram yield during load
>   migration: Allow blocking mode for incoming live migration
>   migration/vfio: Drop BQL dependency for loadvm SWITCHOVER_START
> 
>  include/io/channel.h        |  15 +++
>  include/migration/colo.h    |   6 +-
>  migration/migration.h       | 109 +++++++++++++++++--
>  migration/savevm.h          |   4 +-
>  hw/vfio/migration-multifd.c |   3 -
>  io/channel.c                |  21 ++--
>  migration/channel.c         |   7 +-
>  migration/colo-stubs.c      |   2 +-
>  migration/colo.c            |  26 ++---
>  migration/migration.c       |  81 ++++++++------
>  migration/qemu-file.c       |   6 +-
>  migration/ram.c             |  13 +--
>  migration/rdma.c            | 204 ++++++++----------------------------
>  migration/savevm.c          |  98 +++++++++--------
>  migration/trace-events      |   4 +-
>  15 files changed, 291 insertions(+), 308 deletions(-)
> 

Works well in my COLO testing. Fro the whole series:

Tested-by: Lukas Straub <[email protected]>

Attachment: pgpQFBW5jkhgX.pgp
Description: OpenPGP digital signature

Reply via email to