colo: Deprecate COLO migration framework

Zhang Chen Fri, 16 Jan 2026 00:17:55 -0800

On Fri, Jan 16, 2026 at 8:37 AM Dr. David Alan Gilbert <[email protected]> wrote:
>
> * Peter Xu ([email protected]) wrote:
> > On Thu, Jan 15, 2026 at 10:59:47PM +0000, Dr. David Alan Gilbert wrote:
> > > * Peter Xu ([email protected]) wrote:
> > > > On Thu, Jan 15, 2026 at 10:49:29PM +0100, Lukas Straub wrote:
> > > > > Nack.
> > > > >
> > > > > This code has users, as explained in my other email:
> > > > > https://lore.kernel.org/qemu-devel/20260115224516.7f0309ba@penguin/T/#mc99839451d6841366619c4ec0d5af5264e2f6464
> > > >
> > > > Please then rework that series and consider include the following (I
> > > > believe I pointed out a long time ago somewhere..):
> > > >
> > >
> > > > - Some form of justification of why multifd needs to be enabled for 
> > > > COLO.
> > > >   For example, in your cluster deployment, using multifd can improve XXX
> > > >   by YYY.  Please describe the use case and improvements.
> > >
> > > That one is pretty easy; since COLO is regularly taking snapshots, the 
> > > faster
> > > the snapshoting the less overhead there is.
> >
> > Thanks for chiming in, Dave.  I can explain why I want to request for some
> > numbers.
> >
> > Firstly, numbers normally proves it's used in a real system.  It's at least
> > being used and seriously tested.
>
> Fair.
>
> > Secondly, per my very limited understanding to COLO... the two VMs in most
> > cases should be in-sync state already when both sides generate the same
> > network packets.
>
> (It's about a decade since I did any serious Colo, so I'll try and remember)


Haha, that was a pleasant time~
I already explained the background in the previous email.

>
> > Another sync (where multifd can start to take effect) is only needed when
> > there're packets misalignments, but IIUC it should be rare.  I don't know
> > how rare it is, it would be good if Lukas could introduce some of those
> > numbers in his deployment to help us understand COLO better if we'll need
> > to keep it.
>
> In reality misalignments are actually pretty common - although it's
> very workload dependent.  Any randomness in the order of execution in a 
> multi-threaded
> guest for example, or when a timer arrives etc can change the packet 
> generation.
> The migration time then becomes a latency issue before you can
> transmit the mismatched packet once it's detected.
>
> I think You still need to send a regular stream of snapshots even without
> having *yet* received a packet difference.  Now, I'm trying to remember the
> reasoning; for a start if you leave the difference too long the migration
> snapshot gets larger (which I think needs to be stored on RAM on the dest?)
> and also you increase the chances of them getting a packet difference from
> randomness increases.
> I seem to remember there were clever schemes to get the optimal snapshot
> scheme.

Basically correct. As I explaned in the previous email.
We cannot expect to lose migration for an extended period of time.
Even if the application's results are consistent, it cannot guarantee that
two independently running guest kernels will behave completely identically.

>
> > IIUC, the critical path of COLO shouldn't be migration on its own?  It
> > should be when heartbeat gets lost; that normally should happen when two
> > VMs are in sync.  In this path, I don't see how multifd helps..  because
> > there's no migration happening, only the src recording what has changed.
> > Hence I think some number with description of the measurements may help us
> > understand how important multifd is to COLO.
>
> There's more than one critical path:
>   a) Time to recovery when one host fails
>   b) Overhead when both hosts are happy.
>
> > Supporting multifd will cause new COLO functions to inject into core
> > migration code paths (even if not much..). I want to make sure such (new)
> > complexity is justified. I also want to avoid introducing a feature only
> > because "we have XXX, then let's support XXX in COLO too, maybe some day
> > it'll be useful".
>
> I can't remember where the COLO code got into the main migration paths;
> is that the reception side storing the received differences somewhere else?
>

Yes. COLO secondary have a buffer to store the primary VMstate.
And load it when triggered the checkpoint.

Thanks
Chen

> > After these days, I found removing code is sometimes harder than writting
> > new..
>
> Haha yes.
>
> Dave
>
> > Thanks,
> >
> > >
> > > Lukas: Given COLO has a bunch of different features (i.e. the block
> > > replication, the clever network comparison etc) do you know which ones
> > > are used in the setups you are aware of?
> > >
> > > I'd guess the tricky part of a test would be the network side; I'm
> > > not too sure how you'd set that in a test.
> >
> > --
> > Peter Xu
> >
> --
>  -----Open up your eyes, open up your mind, open up your code -------
> / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \
> \        dave @ treblig.org |                               | In Hex /
>  \ _________________________|_____ http://www.treblig.org   |_______/

Re: [PATCH 1/3] migration/colo: Deprecate COLO migration framework

Reply via email to