On Thu, Nov 6, 2025 at 9:10 AM Zhijian Li (Fujitsu)
<[email protected]> wrote:
>
>
>
> On 06/11/2025 04:58, Peter Xu wrote:
> > On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote:
> >> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> >> state before completed during migration, which broke the original 
> >> transition
> >> to COLO. The migration flow for precopy has changed to:
> >> active -> pre-switchover -> device -> completed.
> >>
> >> This patch updates the transition state to ensure that the Pre-COLO
> >> state corresponds to DEVICE state correctly.
> >>
> >> Fixes: 4881411136 ("migration: Always set DEVICE state")
> >> Signed-off-by: Li Zhijian <[email protected]>
> >> ---
> >>   migration/migration.c | 4 ++--
> >>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index a63b46bbef..6ec7f3cec8 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
> >>           goto fail;
> >>       }
> >>
> >> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> >> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
> >>           /* COLO does not support postcopy */
> >> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> >> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
> >>                             MIGRATION_STATUS_COLO);
> >>       } else {
> >>           migration_completion_end(s);
> >
> > Thanks a lot for fixing it, Zhijian.  It means I broke COLO already for
> > 10.0/10.1..
> >
> > Hailiang/Chen, do you still know anyone who is using COLO, especially in
> > enterprise?  I don't expect any individual using it.. It definitely
> > complicates migration logics all over the places.  Fabiano and I discussed
> > a few times on removing legacy code and COLO was always in the list.
> >
> > We used to discuss RDMA obsoletion too, that's when Huawei developers at
> > least tried to re-implement the whole RDMA using rsocket, that didn't land
> > only because of a perf regression.  Meanwhile, Zhijian also provided an
> > unit test, which we rely on recently to not break RDMA at the minimum.
> >
> > If we do not have known users, I sincerely want to discuss with you on
> > obsoletion and removal of COLO from qemu codebase.  Do you see feasible?
> >
> > Zhijian, do you have any input here?
>
>
> If we don't have any known users, I personally have no objection to removing 
> COLO.
>
>  From my previous understanding, its use cases are rather limited, and the 
> checkpointing overhead is significant.
> Moreover, with the continuous development of Cloud Native over the past 
> decade, service-based
> FT/HA solutions have become very mature, which shrinks the use cases for 
> VM-based FT solutions even further.
>
> I think it's worth keeping if we have:
>
> - Active users who depend on it.
> - A unit test for the COLO framework.
>
> Thanks
> Zhijian
>
>

Add CC Lukas.

>From technical point, I agree Zhijian's comments. We can probably do
this gradually.
In my side, I know some local companies build thier HA/FT product based on COLO.
In this case, I think most of them already forked QEMU upstream code
to a private repo for internal mantained.
It may caused some upgrade issues in the future.

And another part is Lukas covered pacemaker project integrated COLO,
and I don't know users status for pacemaker.
Maybe Lukas can input some comments?

For the implementation, COLO not only have migration part of code(it
is the core of COLO), it also including network and block replication
for co-working.
If we remove migration related code need to consider how to handle
other parts, network maybe change to general QEMU netfilter?  block
replication ?

For the COLO framework unit test,  I think it need to add some "#if
defined(qtest)" in migration code for testing(COLO proxy/netfilter
already have independent qtest).

Thanks
Chen





>
> >
> > Thanks,
> >

Reply via email to