On Tue, Jan 06, 2026 at 04:19:44PM +0530, Prasad Pandit wrote:
> Hello Peter, Fabiano,
> 
> * Thank you for the comments.
> 
> On Tue, 23 Dec 2025 at 20:47, Peter Xu <[email protected]> wrote:
> > I raised a request while I was discussing with you internally, I didn't see
> > this, I will request again:
> >
> > Would you please list where you decided to switch from FAILED -> FAILING,
> > and where you decided not, with justifications for each of them?
> >
> > Let me give a detailed example in this patch, please see below.
> > This is the first real change that we'll switch to FAILING when
> > migration_connect_set_error() is invoked and migration failed.
> >
> > Please justify why setting FAILING is correct here.
> >
> > This function is invoked in three callers:
> >
> > qmp_migrate[2302]              migration_connect_set_error(s, local_err);
> > qmp_migrate_finish[2347]       migration_connect_set_error(s, local_err);
> > migration_connect[4047]        migration_connect_set_error(s, error_in);
> >
> > At least from the initial two callers, I don't see migration_cleanup()
> > invoked after setting FAILING.  Could this cause migration to get into
> > FAILING status forever without finally move to FAILED?
> 
> * Ah, sorry about this; I missed removing this and
> qmp_migrate_finish() change from this patch. Idea was to switch to the
> interim FAILING state wherever execution follows the
> migration_iteration_finish() OR bg_migration_iteration_finish() path
> and leads to migration_cleanup() setting the final state to FAILED. (I
> create/test patches on a pair of machines and then copy them to the
> local repository on my laptop to send upstream, can not do git
> send-email from test machines, these discrepancies fall through at
> times)
> 
> * As you requested, please see below the list of calls where the
> change was applied and where it was not. Justification for changing to
> the interim 'FAILING' state is that execution follows through to
> migration_cleanup() path. And Justification for not changing the state
> is that execution does not follow through to the migration_cleanup()
> path.
> ===
> 01  void migration_channel_process_incoming(QIOChannel *ioc)
>       ...
>         error_report_err(local_err);
>         migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> 
> 02  static void cpr_exec_cb(void *opaque)
>       ...
>         error_report_err(error_copy(err));
>         migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> 
> 03  static void coroutine_fn process_incoming_migration_co(void *opaque)
>       ...
>       fail:
>          migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
>                            MIGRATION_STATUS_FAILED);
>          migrate_error_propagate(s, local_err);
> 
> 04  static void migration_connect_error_propagate(MigrationState *s,
> Error *error)
>       ...
>         case MIGRATION_STATUS_SETUP:
>             next = MIGRATION_STATUS_FAILED;
>             break;
>         migrate_error_propagate(s, error);
> 
> 05  static void qmp_migrate_finish(MigrationAddress *addr, ...)
>       ...
>         else {
>             error_setg(&local_err, "uri is not a valid migration protocol");
>             migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                               MIGRATION_STATUS_FAILED);
> 
> 06  static void multifd_recv_terminate_threads(Error *err)
>       ...
>         if (s->state == MIGRATION_STATUS_SETUP ||
>             s->state == MIGRATION_STATUS_ACTIVE) {
>             migrate_set_state(&s->state, s->state,
>                               MIGRATION_STATUS_FAILED);
> 
> 07  static void *postcopy_listen_thread(void *opaque)
>       ...
>         error_report_err(local_err);
>         migrate_set_state(&mis->state, mis->state, MIGRATION_STATUS_FAILED);
>         goto out;
> 
> 08  static int qemu_savevm_state(QEMUFile *f, Error **errp)
>       ...
>       cleanup:
>         if (ret != 0) {
>           status = MIGRATION_STATUS_FAILED;
>         } else {
>           status = MIGRATION_STATUS_COMPLETED;
>         }
>         migrate_set_state(&ms->state, MIGRATION_STATUS_SETUP, status);
> 
> * Above MIGRATION_STATUS_FAILED instances are _not_ changed because
> they do not follow the migration_cleanup() path. The instances below
> are changed to the interim FAILING state because execution follows
> through the migration_cleanup() path, wherein the interim state
> changes to the FAILED state.
> 
> 09  static int postcopy_start(MigrationState *ms, Error **errp)
>       ...
>         if (ms->state != MIGRATION_STATUS_CANCELLING) {
>            migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> 
>       ...
>       fail:
>         if (ms->state != MIGRATION_STATUS_CANCELLING) {
>             migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
>         }
> 
> 10  static void migration_completion(MigrationState *s)
>       ...
>       fail:
>         if (s->state != MIGRATION_STATUS_CANCELLING) {
>             migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>         }
> 
> 11  static void bg_migration_completion(MigrationState *s)
>       ...
>       fail:
>         migrate_set_state(&s->state, current_active_state,
> MIGRATION_STATUS_FAILED);
> 
> 12  static MigThrError migration_detect_error(MigrationState *s)
>       ...
>         migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
>         /* Time to stop the migration, now. */
>         return MIG_THR_ERR_FATAL;
> 
> 13  static void *migration_thread(void *opaque)
>       ...
>         migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> MIGRATION_STATUS_FAILED);
>         goto out;
> 
> 14  static void *bg_migration_thread(void *opaque)
>       ...
>         migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> MIGRATION_STATUS_FAILED);
>         goto fail_setup;
> 
>       fail:
>         if (early_fail) {
>             migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> MIGRATION_STATUS_FAILED);
>         }
>       fail_setup:
>         bg_migration_iteration_finish(s);
> 
> 15  void migration_connect(MigrationState *s, Error *error_in)
>       ...
>         if (s->state != MIGRATION_STATUS_CANCELLING) {
>             migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>         }
>         migration_cleanup(s);
> 
> 16  static void multifd_send_error_propagate(Error *err)
>       ...
>         s->state == MIGRATION_STATUS_DEVICE ||
>         s->state == MIGRATION_STATUS_ACTIVE) {
>             migrate_set_state(&s->state, s->state,
>                               MIGRATION_STATUS_FAILED);
> 
> 17  bool multifd_send_setup(void)
>       ...
>       err:
>         migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                       MIGRATION_STATUS_FAILED);
>       return false;
> ===

Yes, something like this would be more than welcomed, thanks.  You can
provide a simplified version in the commit log when repost.

Note that I'm not reading carefully into each of them, because we have
concurrent changes from Fabiano recently, essentially it'll change when
migration_cleanup() will be used in above examples:

https://lore.kernel.org/r/[email protected]

So we'll need to be careful landing these two changes.

Considering that the other series was close to landing, IMHO it might be
good idea to have your patch (when reposted) based on that series.  Please
have a look.

> 
> * As proposed earlier, I'll request to divide this patchset into two
> (or even) more patches as:
>     - One patch to fix the race condition issue at hand, by way of
> introducing the interim FAILING state.
>     - One or more patches to use the interim FAILING state, where we
> currently set the FAILED state.
> That should help us to do/test/review these changes. It is complicated
> how this state change is done/handled and IIUC, migration code seems
> to serve multiple use cases: one is live VM migration between two
> machines, second is taking snapshots of a VM, third is migrating a VM
> to a new QEMU process in the same host. Doing such a change in a
> single large patch seems risky.

I still think FAILING isn't such a huge change so it needs to be split into
multiple patches.  It's a new status and we need to review every spot of
FAILED status change, in which case one patch is very well self contained.

Even in a backport I think we should backport all relevant changes about
FAILING when necessary.  We should not backport part of it, causing FAILING
status to behave different over different paths.

Thanks,

-- 
Peter Xu


Reply via email to