Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp

Dr. David Alan Gilbert Mon, 13 Jun 2022 04:13:27 -0700

* Peter Xu (pet...@redhat.com) wrote:
> On Thu, Jun 09, 2022 at 05:02:29PM -0400, Peter Xu wrote:
> > On Wed, Jun 08, 2022 at 06:05:28PM +0100, Dr. David Alan Gilbert wrote:
> > > > @@ -2005,7 +2005,17 @@ static void loadvm_postcopy_handle_run_bh(void 
> > > > *opaque)
> > > >      /* TODO we should move all of this lot into postcopy_ram.c or a 
> > > > shared code
> > > >       * in migration.c
> > > >       */
> > > > -    cpu_synchronize_all_post_init();
> > > > +    cpu_synchronize_all_post_init(&local_err);
> > > > +    if (local_err) {
> > > > +        /*
> > > > +         * TODO: a better way to do this is to tell the src that we 
> > > > cannot
> > > > +         * run the VM here so hopefully we can keep the VM running on 
> > > > src
> > > > +         * and immediately halt the switch-over.  But that needs work.
> > > 
> > > Yes, I think it is possible; unlike some of the later errors in the same
> > > function, in this case we know no disks/network/etc have been touched,
> > > so we should be able to recover.
> > > I wonder if we can move the postcopy_state_set(POSTCOPY_INCOMING_RUNNING)
> > > out of loadvm_postcopy_handle_run to after this point.
> > > 
> > > We've already got the return path, so we should be able to signal the
> > > failure unless we're very unlucky.
> > 
> > Right.  It's just that for the new ACK we may need to modify the return
> > path protocol for sure, because none of the existing ones can notify such
> > an information.
> > 
> > One idea is to reuse MIG_RP_MSG_RESUME_ACK, it was only used for postcopy
> > recovery before to do the final handshake with offload=1 only (which is
> > defined as MIGRATION_RESUME_ACK_VALUE).  We could try to fill in the
> > payload with some !1 value, to tell the source that we NACK the migration
> > then src fails the migration as long as possible?
> > 
> > That seems to be even compatibile with one old qemu migrating to a new qemu
> > scenario, because when the old qemu notices the MIG_RP_MSG_RESUME_ACK
> > message with !1 payload, it'll mark the rp bad:
> 
> Oh it won't be compatible..  The clean way to do this is we need to modify
> the src qemu to halt in postcopy_start() to wait for that ack before
> continue.  That may need another cap/param to enable.


OK; I was wondering aobut sending a RP_MSG_SHUT with a failure; but if
you'd need to change the source it's still a problem.

> The thing is I'm not very sure whether this will be worth it.
> 
> Non-compatible migrations should be rare on put register failures.  For the
> issue I was working on, it was actually a kernel bug that triggered it but
> it's just hard to figure out where's wrong.  With properly working kernels
> and matching hosts they should just not really heppen.  I'm worried adding
> too much complexity could over-engineer things without much benefits.

OK that makes sense.

> In that case, I'd think it proper if we start with what this patchset
> provides, which at least allows us to fail in a crystal clear way?

Yes, the clear error is important.

Dave

> > 
> >   if (migrate_handle_rp_resume_ack(ms, tmp32)) {
> >       mark_source_rp_bad(ms);
> >       goto out;
> >   }
> > 
> >   static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> >   {
> >       trace_source_return_path_thread_resume_ack(value);
> >   
> >       if (value != MIGRATION_RESUME_ACK_VALUE) {
> >           error_report("%s: illegal resume_ack value %"PRIu32,
> >                        __func__, value);
> >           return -1;
> >       }
> >       ...
> >   }
> > 
> > If it looks generally good, I can try with such a change in v2.
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp

Reply via email to