* Wei Wang (wei.w.w...@intel.com) wrote:
> On 02/09/2018 08:15 PM, Dr. David Alan Gilbert wrote:
> > * Wei Wang (wei.w.w...@intel.com) wrote:
> > > This patch adds a timer to limit the time that host waits for the free
> > > page hints reported by the guest. Users can specify the time in ms via
> > > "free-page-wait-time" command line option. If a user doesn't specify a
> > > time, host waits till the guest finishes reporting all the free page
> > > hints. The policy (wait for all the free page hints to be reported or
> > > use a time limit) is determined by the orchestration layer.
> > That's kind of a get-out; but there's at least two problems:
> >     a) With a timeout of 0 (the default) we might hang forever waiting
> >        for the guest; broken guests are just too common, we can't do
> >        that.
> >     b) Even if we were going to do that, you'd have to make sure that
> >        migrate_cancel provided a way out.
> >     c) How does that work during a savevm snapshot or when the guest is
> >        stopped?
> >     d) OK, the timer gives us some safety (except c); but how does the
> >        orchestration layer ever come up with a 'safe' value for it?
> >        Unless we can suggest a safe value that the orchestration layer
> >        can use, or a way they can work it out, then they just wont use
> >        it.
> > 
> 
> Hi Dave,
> 
> Sorry for my late response. Please see below:
> 
> a) I think people would just kill the guest if it is broken. We can also
> change the default timeout value, for example 1 second, which is enough for
> the free page reporting.

Remember that many VMs are automatically migrated without their being a
human involved; those VMs might be in the BIOS or Grub or shutting down at
the time of migration; there's no human to look at the VM.

> b) How about changing it this way: if timeout happens, host sends a stop
> command to the guest, and makes virtio_balloon_poll_free_page_hints()
> "return" immediately (without getting the guest's acknowledge). The "return"
> basically goes back to the migration_thread function:
> while (s->state == MIGRATION_STATUS_ACTIVE ||
>            s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> ...
> }
> 
> migration_cancel sets the state to MIGRATION_CANCELLING, so it will stop the
> migration process.

OK, but htat does rely on there being a timeout; it means you can't have
the default no-timeout because then you can't cancel.

> c) This optimization needs the guest to report. If the guest is stopped, it
> wouldn't work. How about adding a check of "RUN_STATE" before going into the
> optimization?

Yes, that's OK.

> d) Yes. Normally it is faster to wait for the guest to report all the free
> pages. Probably, we can just hardcode a value (e.g. 1s) for now (instead of
> making it configurable by users), this is used to handle the case that the
> guest is broken. What would you think?

The issue is not about configurability - the issue is that it's
hard/impossible to find a good value for the timeout.

Dave

> Best,
> Wei
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Reply via email to