Prasad Pandit <[email protected]> writes:

> Hi,
>
> On Tue, 23 Dec 2025 at 21:00, Peter Xu <[email protected]> wrote:
>> One thing good about exposing such status via QAPI is, it can help us
>> diagnose issues by seeing CANCELLING / FAILING even looking at
>> query-migrate results (as normally when bug happens we can't see the
>> internal status..), so that we know either it's explicitly cancelled, or
>> something went wrong.
>>
>> If it's a completely hidden / internal status, we may see ACTIVE even if
>> something wrong happened..
>
> * Both process state and reason(s) for the state change needs to be
> visible to the user. But states like cancelling/failing are redundant,
> users would derive the same conclusion from CANCELLED and CANCELLING
> OR FAILED AND FAILING. Besides, migration_cleanup() does exactly the
> same steps irrespective of whether migration is failing or cancelling
> or failed or cancelled.
>
>> My current hope is any mgmt should normally by default ignore new migration
>> states..  If that's always achieved, it looks to me adding FAILING directly
>> into migration status would still have some benefits on debugging.
>
> * libvirtd(8) complains about unknown states multiple times:
>       libvirtd[2194267]: unknown status 'failing' in migration event
>       libvirtd[2194267]: unknown status 'failing' in migration event
>       libvirtd[2194267]: unknown status 'failing' in migration event
>
>
>> > > @@ -2907,7 +2914,7 @@ fail_closefb:
>> > >      qemu_fclose(fb);
>> > >  fail:
>> > >      if (ms->state != MIGRATION_STATUS_CANCELLING) {
>> > > -        migrate_set_state(&ms->state, ms->state, 
>> > > MIGRATION_STATUS_FAILED);
>> > > +        migrate_set_state(&ms->state, ms->state, 
>> > > MIGRATION_STATUS_FAILING);
>> > >      }
>> >
>> > This is a good example where having MigrationStatus makes the code more
>> > complicated. This could be exiting=true, running=false, etc. It
>> > shouldn't matter why this routine failed. If we reach
>> > migration_cleanup() and, at the very end, state is CANCELLING, then we
>> > know the cancel command has caused this and set the state to
>> > CANCELLED. If the state was something else, then it's unintended and we
>> > set FAILED.
>>
>> If it'll be an internal status, we'll still need to identify if someone
>> already have cancelled it, right?
>>
>> Assuming we introduce stop_reason flag, making it:
>>
>> enum {
>>     MIG_STOP_REASON_CANCEL,
>>     MIG_STOP_REASON_FAIL,
>> } MigrationStopReason;
>>
>> Then we can switch to CANCELLED / FAILED when cleanup from those reasons.
>>
>> Then here, logically we also need logic like:
>>
>>     if (stop_reason != MIG_STOP_REASON_CANCEL) {
>>         stop_reason = MIG_STOP_REASON_FAIL;
>>     }
>>
>> Because we want to make sure when the user already triggered cancel, it
>> won't show FAILED but only show CANCELLED at last?
>
> * I think the way we are setting/changing these states in as many
> locations is only adding to the complications. Do we have to
> explicitly set these states like this? What if migration_cleanup()
> always sets the state to 'STOP'. Similarly other places set the state
> to a predefined state. OR
> ===
>     struct {
>         current_state;
>         old_state;
>         event/trigger;
>         reason[];
>     } MigrationState s;
>
>     migration_change_state(s) {
>           s->old_state = s->current_state;
>           if (s->current_state == START && s->trigger ==
> 'connections-established') {
>               s->current_state = ACTIVE;
>               s->reason = "connections-established, migration starting"
>           } else if (s->current_state == ACTIVE && s->trigger == 'completed') 
> {
>               s->current_state = STOP
>               s->reason = "migration completed"
>           } else if (s->current_state == ACTIVE  && s->trigger == 'pause') {
>               s->current_state = PAUSE
>               s->reason = "pause, migration paused"
>           } else if (s->current_state == ACTIVE && s->trigger ==
> 'error-occurred') {
>               s->current_state = STOP
>               s->reason = "Error occurred, migration failed"
>           } else if (s->current_state == ACTIVE && s->trigger ==
> 'user-cancel') {
>               s->current_state = STOP
>               s->reason = "user-cancel, migration cancelled"
>          } else {
>               s->current_state = s->current_state;
>               warn_msg("unknown combination, maybe define a new rule?");
>          }
>     }
> ===
> * We define explicit rules for the state change and accordingly we
> only call migration_change_state() at any point and it'll change to an
> appropriate next state, recording the due reason for the change.
>

If we had a linear state transition table, i.e. a DFA without any
branching, that would be ideal. But since we have states that can reach
(and be reached from) multiple other states, then we'll always need some
input to migration_change_state(). Here you're making it the
s->trigger. Where will that come from?

Looking at runstate.c and job.c, it seems we could at least define a
state transition table and do away with the second parameter to
migrate_set_state(s, old, new).

As we've been discussing, the current state-change mechanism has the
dual purpose of emitting the state change event and also serving as
internal tracking of the migration state. It's not clear to me whether
you're covering both in this proposal or just one of them.

I don't think we've established actually what are the goals of having
any state changes. Do we even need state changes for internal tracking?
We could use your s->trigger as an enum and just check it wherever
necessary. And keep the MIGRATION_STATUS exclusive for the external API,
in which case, it's probably better to just set it unconditionally (in
many places migrate_set_state already takes the current state as
argument, i.e. it doesn't care about the current state).

> Wdyt...?
>
> Thank you.
> ---
>   - Prasad

Reply via email to