Hello Peter, Fabiano,

* Thank you for the comments.

On Tue, 23 Dec 2025 at 20:47, Peter Xu <[email protected]> wrote:
> I raised a request while I was discussing with you internally, I didn't see
> this, I will request again:
>
> Would you please list where you decided to switch from FAILED -> FAILING,
> and where you decided not, with justifications for each of them?
>
> Let me give a detailed example in this patch, please see below.
> This is the first real change that we'll switch to FAILING when
> migration_connect_set_error() is invoked and migration failed.
>
> Please justify why setting FAILING is correct here.
>
> This function is invoked in three callers:
>
> qmp_migrate[2302]              migration_connect_set_error(s, local_err);
> qmp_migrate_finish[2347]       migration_connect_set_error(s, local_err);
> migration_connect[4047]        migration_connect_set_error(s, error_in);
>
> At least from the initial two callers, I don't see migration_cleanup()
> invoked after setting FAILING.  Could this cause migration to get into
> FAILING status forever without finally move to FAILED?

* Ah, sorry about this; I missed removing this and
qmp_migrate_finish() change from this patch. Idea was to switch to the
interim FAILING state wherever execution follows the
migration_iteration_finish() OR bg_migration_iteration_finish() path
and leads to migration_cleanup() setting the final state to FAILED. (I
create/test patches on a pair of machines and then copy them to the
local repository on my laptop to send upstream, can not do git
send-email from test machines, these discrepancies fall through at
times)

* As you requested, please see below the list of calls where the
change was applied and where it was not. Justification for changing to
the interim 'FAILING' state is that execution follows through to
migration_cleanup() path. And Justification for not changing the state
is that execution does not follow through to the migration_cleanup()
path.
===
01  void migration_channel_process_incoming(QIOChannel *ioc)
      ...
        error_report_err(local_err);
        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);

02  static void cpr_exec_cb(void *opaque)
      ...
        error_report_err(error_copy(err));
        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);

03  static void coroutine_fn process_incoming_migration_co(void *opaque)
      ...
      fail:
         migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
                           MIGRATION_STATUS_FAILED);
         migrate_error_propagate(s, local_err);

04  static void migration_connect_error_propagate(MigrationState *s,
Error *error)
      ...
        case MIGRATION_STATUS_SETUP:
            next = MIGRATION_STATUS_FAILED;
            break;
        migrate_error_propagate(s, error);

05  static void qmp_migrate_finish(MigrationAddress *addr, ...)
      ...
        else {
            error_setg(&local_err, "uri is not a valid migration protocol");
            migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                              MIGRATION_STATUS_FAILED);

06  static void multifd_recv_terminate_threads(Error *err)
      ...
        if (s->state == MIGRATION_STATUS_SETUP ||
            s->state == MIGRATION_STATUS_ACTIVE) {
            migrate_set_state(&s->state, s->state,
                              MIGRATION_STATUS_FAILED);

07  static void *postcopy_listen_thread(void *opaque)
      ...
        error_report_err(local_err);
        migrate_set_state(&mis->state, mis->state, MIGRATION_STATUS_FAILED);
        goto out;

08  static int qemu_savevm_state(QEMUFile *f, Error **errp)
      ...
      cleanup:
        if (ret != 0) {
          status = MIGRATION_STATUS_FAILED;
        } else {
          status = MIGRATION_STATUS_COMPLETED;
        }
        migrate_set_state(&ms->state, MIGRATION_STATUS_SETUP, status);

* Above MIGRATION_STATUS_FAILED instances are _not_ changed because
they do not follow the migration_cleanup() path. The instances below
are changed to the interim FAILING state because execution follows
through the migration_cleanup() path, wherein the interim state
changes to the FAILED state.

09  static int postcopy_start(MigrationState *ms, Error **errp)
      ...
        if (ms->state != MIGRATION_STATUS_CANCELLING) {
           migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);

      ...
      fail:
        if (ms->state != MIGRATION_STATUS_CANCELLING) {
            migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
        }

10  static void migration_completion(MigrationState *s)
      ...
      fail:
        if (s->state != MIGRATION_STATUS_CANCELLING) {
            migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
        }

11  static void bg_migration_completion(MigrationState *s)
      ...
      fail:
        migrate_set_state(&s->state, current_active_state,
MIGRATION_STATUS_FAILED);

12  static MigThrError migration_detect_error(MigrationState *s)
      ...
        migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
        /* Time to stop the migration, now. */
        return MIG_THR_ERR_FATAL;

13  static void *migration_thread(void *opaque)
      ...
        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
MIGRATION_STATUS_FAILED);
        goto out;

14  static void *bg_migration_thread(void *opaque)
      ...
        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
MIGRATION_STATUS_FAILED);
        goto fail_setup;

      fail:
        if (early_fail) {
            migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
MIGRATION_STATUS_FAILED);
        }
      fail_setup:
        bg_migration_iteration_finish(s);

15  void migration_connect(MigrationState *s, Error *error_in)
      ...
        if (s->state != MIGRATION_STATUS_CANCELLING) {
            migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
        }
        migration_cleanup(s);

16  static void multifd_send_error_propagate(Error *err)
      ...
        s->state == MIGRATION_STATUS_DEVICE ||
        s->state == MIGRATION_STATUS_ACTIVE) {
            migrate_set_state(&s->state, s->state,
                              MIGRATION_STATUS_FAILED);

17  bool multifd_send_setup(void)
      ...
      err:
        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                      MIGRATION_STATUS_FAILED);
      return false;
===

* As proposed earlier, I'll request to divide this patchset into two
(or even) more patches as:
    - One patch to fix the race condition issue at hand, by way of
introducing the interim FAILING state.
    - One or more patches to use the interim FAILING state, where we
currently set the FAILED state.
That should help us to do/test/review these changes. It is complicated
how this state change is done/handled and IIUC, migration code seems
to serve multiple use cases: one is live VM migration between two
machines, second is taking snapshots of a VM, third is migrating a VM
to a new QEMU process in the same host. Doing such a change in a
single large patch seems risky.

Thank you.
---
  - Prasad


Reply via email to