Re: [PATCH] migration: introduce MIGRATION_STATUS_FAILING

Fabiano Rosas Wed, 07 Jan 2026 05:24:21 -0800

Prasad Pandit <[email protected]> writes:

> Hi,
>
> On Tue, 6 Jan 2026 at 19:17, Fabiano Rosas <[email protected]> wrote:
>> If we had a linear state transition table, i.e. a DFA without any
>> branching, that would be ideal. But since we have states that can reach
>> (and be reached from) multiple other states, then we'll always need some
>> input to migration_change_state(). Here you're making it the
>> s->trigger. Where will that come from?
>
> * The trigger or reason can come from the place where we call
> migration_change_state(), there we'll know whether migration has
> paused OR completed OR failed OR cancelled.
>


It would be interesting maybe to restrict the set of
states/triggers/events (I'm not sure) to user-visible phases only, and
those would be defined by anything that's triggered via a QMP
command. Plus the error state which goes in the other direction.

In isolation, having a "trigger" for each QMP command seems like a good
idea to me. It could be just a flag that tells us what is the current
command that's being serviced. Most of migration are actions in response
to a QMP command. This could help with ensuring correctness in
concurrent invocations.

> * Even with branches, the process is still linear as it goes from
> start to finish. Just that we can reach the end state via different
> paths.
>   ===
>     $ grep -ri 'shutting' /var/log/libvirt/qemu/   | cut -d' ' -f 3- |
> sort | uniq
>      shutting down, reason=crashed
>      shutting down, reason=destroyed
>      shutting down, reason=failed
>      shutting down, reason=migrated
>      shutting down, reason=shutdown
> ===
> As we see, guest VM can stop/shutdown due to various reasons.
>
> * Between [migration-start] and [migration-end], we can define
> events/triggers that will lead to the next state. Ex
>
>       - START -> t:connection-established -> ACTIVE

I like this because it forces us to determine more clearly what is the
necessary condition for a state change. This could eventually allow the
abstraction of the qapi_event_send_migration() to a higher
layer. Something like:

void qmp_migrate() {
    t:migrate=true
    
    migration_setup() :: do setup, mig_setup_done=true
    migration_advance_state() :: checks the triggers, changes state and
                                 sends the event

    migration_start() :: migrate, mig_done=true
                         failure, mig_failed=true
                         etc
    migration_advance_state()

    migrate_vmstate() :: device state migration, mig_device_done=true
    migration_advance_state()

 etc..
}

IOW, we could do a better job of separating what is work, what is
migration control flow, what is error handling, etc.

>
> We can reach the ACTIVE state only after connections are established,
> not without that. If connection establishment fails, we reach the END.
>
>      - START  -> t:connection-established -> ACTIVE ->  running   -> END
>
> ACTIVE ->  t:error     ->  END
>
> ACTIVE ->  t:cancel  ->  END
>
> ACTIVE ->  t:pause   ->  PAUSED  -> t:resume -> ACTIVE
>

Looks good, not sure if the actual migration flow would fit this, but
let's assume it would.

>> Looking at runstate.c and job.c, it seems we could at least define a
>> state transition table and do away with the second parameter to
>> migrate_set_state(s, old, new).
>>
>> As we've been discussing, the current state-change mechanism has the
>> dual purpose of emitting the state change event and also serving as
>> internal tracking of the migration state. It's not clear to me whether
>> you're covering both in this proposal or just one of them.
>
> * We are not doing away with migration states, just reducing or
> rationalising them to make it easier. Emitting state change to
> libvirtd(8) and internal tracking should still serve the same. Just
> that in migration_is_running() etc. functions we'll check only if the
> state is ACTIVE, instead of 10 other states which also indicate that
> the migration is running.
>

What I'm trying to convey is that we have:

1) events API that needs to be kept stable, this list of states that
   libvirt sees and at what moments we emit them.

2) MigrationStatus being used as an internal record of the current
   (loosely defined) migration phase. This is "arbitrary", hence we're
   discussing adding a new MigrationStatus "just" to make sure we don't
   start a new migration at the wrong moment.

I'm trying to understand if you want to cover 1, 2 or both.

I would suggest we first take all of the internal tracking, i.e. #2, the
"if (state==MIGRATION_STATUS)" code and convert them to use some other
state tracking, either the triggers as you suggest, or random booleans
sprinkled all over, it's not immediately important.

Once that is done, then we could freeze the #1, MigrationStatus. It
would only change whenever we wanted to change the API and that should
be a well documented change.

>> I don't think we've established actually what are the goals of having
>> any state changes. Do we even need state changes for internal tracking?
>> We could use your s->trigger as an enum and just check it wherever
>> necessary. And keep the MIGRATION_STATUS exclusive for the external API,
>> in which case, it's probably better to just set it unconditionally (in
>> many places migrate_set_state already takes the current state as
>> argument, i.e. it doesn't care about the current state).
>
> * Well as I see it, different states help us to
>       1 - know where the process is at a given time. In case of
> errors/failures or other events to know what actions to take.
>       2 - what actions/triggers/events are possible.
>
> ex. If an error/cancel occurs before ACTIVE state, during connection
> establishment, it may not have to go through migration_cleanup(),
> probably there's nothing to cleanup. Vs if an error/cancel occurs
> after ACTIVE  or in PAUSED state, we know migration_cleanup() is
> needed.  Similarly if we receive t:resume command when in ACTIVE
> state, OR receive t:pause command in PAUSED state,  we know there's
> nothing to do and ignore it.
>

Ok, maybe I'm splittling hairs here, I was trying to understand whether
all of these "if (s->state ...)" have the same semantics.

a) For cases such as CANCELLING: that could be a simple
   s->trigger[MIGRATE_CANCEL]=1.

  (we're not removing the CANCELLING state due to the API stability, but
  still)

b) For error conditions: s->event[FAILED]=1, then (possibly at a later
   point in migration_change_state):

   if (s->event[FAILED] && !s->trigger[MIGRATE_CANCEL]) {
      migrate_set_state(s->state, MIGRATION_STATUS_FAILED);
   }

b) For postcopy resume/pause, etc, maybe an actual state machine that can
   only be in one state would be helpful.

c) For "we reached this point, so set this state", most of those could
   just be an invocation to migration_change_state() and, as you
   suggest, that would look for the evidence elsewhere to know what
   state to set:

   if (s->trigger[MIGRATE] && s->event[COMPLETED]) {
      migrate_set_state(s->state, MIGRATION_STATUS_COMPLETED);
   }

Re: [PATCH] migration: introduce MIGRATION_STATUS_FAILING

Reply via email to