* Wei Wang (wei.w.w...@intel.com) wrote: > On 02/09/2018 08:15 PM, Dr. David Alan Gilbert wrote: > > * Wei Wang (wei.w.w...@intel.com) wrote: > > > This patch adds a timer to limit the time that host waits for the free > > > page hints reported by the guest. Users can specify the time in ms via > > > "free-page-wait-time" command line option. If a user doesn't specify a > > > time, host waits till the guest finishes reporting all the free page > > > hints. The policy (wait for all the free page hints to be reported or > > > use a time limit) is determined by the orchestration layer. > > That's kind of a get-out; but there's at least two problems: > > a) With a timeout of 0 (the default) we might hang forever waiting > > for the guest; broken guests are just too common, we can't do > > that. > > b) Even if we were going to do that, you'd have to make sure that > > migrate_cancel provided a way out. > > c) How does that work during a savevm snapshot or when the guest is > > stopped? > > d) OK, the timer gives us some safety (except c); but how does the > > orchestration layer ever come up with a 'safe' value for it? > > Unless we can suggest a safe value that the orchestration layer > > can use, or a way they can work it out, then they just wont use > > it. > > > > Hi Dave, > > Sorry for my late response. Please see below: > > a) I think people would just kill the guest if it is broken. We can also > change the default timeout value, for example 1 second, which is enough for > the free page reporting.
Remember that many VMs are automatically migrated without their being a human involved; those VMs might be in the BIOS or Grub or shutting down at the time of migration; there's no human to look at the VM. > b) How about changing it this way: if timeout happens, host sends a stop > command to the guest, and makes virtio_balloon_poll_free_page_hints() > "return" immediately (without getting the guest's acknowledge). The "return" > basically goes back to the migration_thread function: > while (s->state == MIGRATION_STATUS_ACTIVE || > s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { > ... > } > > migration_cancel sets the state to MIGRATION_CANCELLING, so it will stop the > migration process. OK, but htat does rely on there being a timeout; it means you can't have the default no-timeout because then you can't cancel. > c) This optimization needs the guest to report. If the guest is stopped, it > wouldn't work. How about adding a check of "RUN_STATE" before going into the > optimization? Yes, that's OK. > d) Yes. Normally it is faster to wait for the guest to report all the free > pages. Probably, we can just hardcode a value (e.g. 1s) for now (instead of > making it configurable by users), this is used to handle the case that the > guest is broken. What would you think? The issue is not about configurability - the issue is that it's hard/impossible to find a good value for the timeout. Dave > Best, > Wei -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK