On Tue, Jan 06, 2026 at 05:15:23PM +0530, Prasad Pandit wrote:
> Hi,
> 
> On Tue, 23 Dec 2025 at 21:00, Peter Xu <[email protected]> wrote:
> > One thing good about exposing such status via QAPI is, it can help us
> > diagnose issues by seeing CANCELLING / FAILING even looking at
> > query-migrate results (as normally when bug happens we can't see the
> > internal status..), so that we know either it's explicitly cancelled, or
> > something went wrong.
> >
> > If it's a completely hidden / internal status, we may see ACTIVE even if
> > something wrong happened..
> 
> * Both process state and reason(s) for the state change needs to be
> visible to the user. But states like cancelling/failing are redundant,
> users would derive the same conclusion from CANCELLED and CANCELLING
> OR FAILED AND FAILING. Besides, migration_cleanup() does exactly the
> same steps irrespective of whether migration is failing or cancelling
> or failed or cancelled.
> 
> > My current hope is any mgmt should normally by default ignore new migration
> > states..  If that's always achieved, it looks to me adding FAILING directly
> > into migration status would still have some benefits on debugging.
> 
> * libvirtd(8) complains about unknown states multiple times:
>       libvirtd[2194267]: unknown status 'failing' in migration event
>       libvirtd[2194267]: unknown status 'failing' in migration event
>       libvirtd[2194267]: unknown status 'failing' in migration event

IIUC these are benign warnings, so should be fine. We'll need one entry for
libvirt ultimately to avoid this warning.  Copying Dan and Jiri in case I
am wrong.

> 
> 
> > > > @@ -2907,7 +2914,7 @@ fail_closefb:
> > > >      qemu_fclose(fb);
> > > >  fail:
> > > >      if (ms->state != MIGRATION_STATUS_CANCELLING) {
> > > > -        migrate_set_state(&ms->state, ms->state, 
> > > > MIGRATION_STATUS_FAILED);
> > > > +        migrate_set_state(&ms->state, ms->state, 
> > > > MIGRATION_STATUS_FAILING);
> > > >      }
> > >
> > > This is a good example where having MigrationStatus makes the code more
> > > complicated. This could be exiting=true, running=false, etc. It
> > > shouldn't matter why this routine failed. If we reach
> > > migration_cleanup() and, at the very end, state is CANCELLING, then we
> > > know the cancel command has caused this and set the state to
> > > CANCELLED. If the state was something else, then it's unintended and we
> > > set FAILED.
> >
> > If it'll be an internal status, we'll still need to identify if someone
> > already have cancelled it, right?
> >
> > Assuming we introduce stop_reason flag, making it:
> >
> > enum {
> >     MIG_STOP_REASON_CANCEL,
> >     MIG_STOP_REASON_FAIL,
> > } MigrationStopReason;
> >
> > Then we can switch to CANCELLED / FAILED when cleanup from those reasons.
> >
> > Then here, logically we also need logic like:
> >
> >     if (stop_reason != MIG_STOP_REASON_CANCEL) {
> >         stop_reason = MIG_STOP_REASON_FAIL;
> >     }
> >
> > Because we want to make sure when the user already triggered cancel, it
> > won't show FAILED but only show CANCELLED at last?
> 
> * I think the way we are setting/changing these states in as many
> locations is only adding to the complications. Do we have to
> explicitly set these states like this? What if migration_cleanup()
> always sets the state to 'STOP'. Similarly other places set the state
> to a predefined state. OR
> ===
>     struct {
>         current_state;
>         old_state;
>         event/trigger;
>         reason[];
>     } MigrationState s;
> 
>     migration_change_state(s) {
>           s->old_state = s->current_state;
>           if (s->current_state == START && s->trigger ==
> 'connections-established') {
>               s->current_state = ACTIVE;
>               s->reason = "connections-established, migration starting"
>           } else if (s->current_state == ACTIVE && s->trigger == 'completed') 
> {
>               s->current_state = STOP
>               s->reason = "migration completed"
>           } else if (s->current_state == ACTIVE  && s->trigger == 'pause') {
>               s->current_state = PAUSE
>               s->reason = "pause, migration paused"
>           } else if (s->current_state == ACTIVE && s->trigger ==
> 'error-occurred') {
>               s->current_state = STOP
>               s->reason = "Error occurred, migration failed"

We can't change status that were already used, like FAILED.  Libvirt and
all mgmt may rely on it.

>           } else if (s->current_state == ACTIVE && s->trigger ==
> 'user-cancel') {
>               s->current_state = STOP
>               s->reason = "user-cancel, migration cancelled"
>          } else {
>               s->current_state = s->current_state;
>               warn_msg("unknown combination, maybe define a new rule?");
>          }
>     }
> ===
> * We define explicit rules for the state change and accordingly we
> only call migration_change_state() at any point and it'll change to an
> appropriate next state, recording the due reason for the change.
> 
> Wdyt...?

Personally I don't see much benefit on adding a new "trigger" internal API.
If we want to forbid some state machine transitions, we can use a
transition map.  Said that, IMHO it's separate from what we're discussing
here.

Thanks,

> 
> Thank you.
> ---
>   - Prasad
> 

-- 
Peter Xu


Reply via email to