On Fri, Jan 16, 2026 at 7:39 AM Peter Xu <[email protected]> wrote: > > On Thu, Jan 15, 2026 at 10:59:47PM +0000, Dr. David Alan Gilbert wrote: > > * Peter Xu ([email protected]) wrote: > > > On Thu, Jan 15, 2026 at 10:49:29PM +0100, Lukas Straub wrote: > > > > Nack. > > > > > > > > This code has users, as explained in my other email: > > > > https://lore.kernel.org/qemu-devel/20260115224516.7f0309ba@penguin/T/#mc99839451d6841366619c4ec0d5af5264e2f6464 > > > > > > Please then rework that series and consider include the following (I > > > believe I pointed out a long time ago somewhere..): > > > > > > > > - Some form of justification of why multifd needs to be enabled for COLO. > > > For example, in your cluster deployment, using multifd can improve XXX > > > by YYY. Please describe the use case and improvements. > > > > That one is pretty easy; since COLO is regularly taking snapshots, the > > faster > > the snapshoting the less overhead there is. > > Thanks for chiming in, Dave. I can explain why I want to request for some > numbers. > > Firstly, numbers normally proves it's used in a real system. It's at least > being used and seriously tested. >
Agree. > Secondly, per my very limited understanding to COLO... the two VMs in most > cases should be in-sync state already when both sides generate the same > network packets. In most cases, you are right. But all the FT/HA system design for the rare cases. > > Another sync (where multifd can start to take effect) is only needed when > there're packets misalignments, but IIUC it should be rare. I don't know > how rare it is, it would be good if Lukas could introduce some of those > numbers in his deployment to help us understand COLO better if we'll need > to keep it. I haven't tested multifd part yet. But let me introduce the background. COLO system including 2 ways for live migration, network compare triggered and periodic execution(maybe 10s). It means COLO VM performance depends on live migration VM stop time, maybe the multifd can help for this, Lukas? > > IIUC, the critical path of COLO shouldn't be migration on its own? It > should be when heartbeat gets lost; that normally should happen when two > VMs are in sync. In this path, I don't see how multifd helps.. because > there's no migration happening, only the src recording what has changed. > Hence I think some number with description of the measurements may help us > understand how important multifd is to COLO. > Yes, after failover, the secondary VM running without migration. > Supporting multifd will cause new COLO functions to inject into core > migration code paths (even if not much..). I want to make sure such (new) > complexity is justified. I also want to avoid introducing a feature only > because "we have XXX, then let's support XXX in COLO too, maybe some day > it'll be useful". > > After these days, I found removing code is sometimes harder than writting > new.. Agree, as Lukas said, some customers not follow upstream code(or 2 versions ago) for COLO. Because FT/HA users focus on system availibility, upgrade is a high risk for them. I think the main reason of COLO broken for QEMU release 10.0/10.1 is lack of test case(Lukas WIP on this). Thanks Chen > > Thanks, > > > > > Lukas: Given COLO has a bunch of different features (i.e. the block > > replication, the clever network comparison etc) do you know which ones > > are used in the setups you are aware of? > > > > I'd guess the tricky part of a test would be the network side; I'm > > not too sure how you'd set that in a test. > > -- > Peter Xu >
