* Peter Xu ([email protected]) wrote:
> On Thu, Jan 15, 2026 at 10:59:47PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu ([email protected]) wrote:
> > > On Thu, Jan 15, 2026 at 10:49:29PM +0100, Lukas Straub wrote:
> > > > Nack.
> > > > 
> > > > This code has users, as explained in my other email:
> > > > https://lore.kernel.org/qemu-devel/20260115224516.7f0309ba@penguin/T/#mc99839451d6841366619c4ec0d5af5264e2f6464
> > > 
> > > Please then rework that series and consider include the following (I
> > > believe I pointed out a long time ago somewhere..):
> > > 
> > 
> > > - Some form of justification of why multifd needs to be enabled for COLO.
> > >   For example, in your cluster deployment, using multifd can improve XXX
> > >   by YYY.  Please describe the use case and improvements.
> > 
> > That one is pretty easy; since COLO is regularly taking snapshots, the 
> > faster
> > the snapshoting the less overhead there is.
> 
> Thanks for chiming in, Dave.  I can explain why I want to request for some
> numbers.
> 
> Firstly, numbers normally proves it's used in a real system.  It's at least
> being used and seriously tested.

Fair.

> Secondly, per my very limited understanding to COLO... the two VMs in most
> cases should be in-sync state already when both sides generate the same
> network packets.

(It's about a decade since I did any serious Colo, so I'll try and remember)

> Another sync (where multifd can start to take effect) is only needed when
> there're packets misalignments, but IIUC it should be rare.  I don't know
> how rare it is, it would be good if Lukas could introduce some of those
> numbers in his deployment to help us understand COLO better if we'll need
> to keep it.

In reality misalignments are actually pretty common - although it's
very workload dependent.  Any randomness in the order of execution in a 
multi-threaded
guest for example, or when a timer arrives etc can change the packet generation.
The migration time then becomes a latency issue before you can
transmit the mismatched packet once it's detected.

I think You still need to send a regular stream of snapshots even without
having *yet* received a packet difference.  Now, I'm trying to remember the
reasoning; for a start if you leave the difference too long the migration
snapshot gets larger (which I think needs to be stored on RAM on the dest?)
and also you increase the chances of them getting a packet difference from
randomness increases.
I seem to remember there were clever schemes to get the optimal snapshot
scheme.

> IIUC, the critical path of COLO shouldn't be migration on its own?  It
> should be when heartbeat gets lost; that normally should happen when two
> VMs are in sync.  In this path, I don't see how multifd helps..  because
> there's no migration happening, only the src recording what has changed.
> Hence I think some number with description of the measurements may help us
> understand how important multifd is to COLO.

There's more than one critical path:
  a) Time to recovery when one host fails
  b) Overhead when both hosts are happy.

> Supporting multifd will cause new COLO functions to inject into core
> migration code paths (even if not much..). I want to make sure such (new)
> complexity is justified. I also want to avoid introducing a feature only
> because "we have XXX, then let's support XXX in COLO too, maybe some day
> it'll be useful".

I can't remember where the COLO code got into the main migration paths;
is that the reception side storing the received differences somewhere else?

> After these days, I found removing code is sometimes harder than writting
> new..

Haha yes.

Dave

> Thanks,
> 
> > 
> > Lukas: Given COLO has a bunch of different features (i.e. the block
> > replication, the clever network comparison etc) do you know which ones
> > are used in the setups you are aware of?
> > 
> > I'd guess the tricky part of a test would be the network side; I'm
> > not too sure how you'd set that in a test.
> 
> -- 
> Peter Xu
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

Reply via email to