Thanks for bringing this up, Daniel. I don't think it makes sense to have a timeout on live migration, but operators should be able to cancel it, just like any other unbounded long-running process. For example, there's no timeout on file transfers, but they need an interface report progress and to cancel them. That would imply an option to cancel evacuation too.
-- Noel On Fri, Jan 30, 2015 at 8:47 AM, Daniel P. Berrange <berra...@redhat.com> wrote: > In working on a recent Nova migration bug > > https://bugs.launchpad.net/nova/+bug/1414065 > > I had cause to refactor the way the nova libvirt driver monitors live > migration completion/failure/progress. This refactor has opened the > door for doing more intelligent active management of the live migration > process. > > As it stands today, we launch live migration, with a possible bandwidth > limit applied and just pray that it succeeds eventually. It might take > until the end of the universe and we'll happily wait that long. This is > pretty dumb really and I think we really ought to do better. The problem > is that I'm not really sure what "better" should mean, except for ensuring > it doesn't run forever. > > As a demo, I pushed a quick proof of concept showing how we could easily > just abort live migration after say 10 minutes > > https://review.openstack.org/#/c/151665/ > > There are a number of possible things to consider though... > > First how to detect when live migration isn't going to succeeed. > > - Could do a crude timeout, eg allow 10 minutes to succeeed or else. > > - Look at data transfer stats (memory transferred, memory remaining to > transfer, disk transferred, disk remaining to transfer) to determine > if it is making forward progress. > > - Leave it upto the admin / user to decided if it has gone long enough > > The first is easy, while the second is harder but probably more reliable > and useful for users. > > Second is a question of what todo when it looks to be failing > > - Cancel the migration - leave it running on source. Not good if the > admin is trying to evacuate a host. > > - Pause the VM - make it complete as non-live migration. Not good if > the guest workload doesn't like being paused > > - Increase the bandwidth permitted. There is a built-in rate limit in > QEMU overridable via nova.conf. Could argue that the admin should just > set their desired limit in nova.conf and be done with it, but perhaps > there's a case for increasing it in special circumstances. eg emergency > evacuate of host it is better to waste bandwidth & complete the job, > but for non-urgent scenarios better to limit bandwidth & accept failure > ? > > - Increase the maximum downtime permitted. This is the small time window > when the guest switches from source to dest. To small and it'll never > switch, too large and it'll suffer unacceptable interuption. > > We could do some of these things automatically based on some policy > or leave them upto the cloud admin/tenant user via new APIs > > Third there's question of other QEMU features we could make use of to > stop problems in the first place > > - Auto-converge flag - if you set this QEMU throttles back the CPUs > so the guest cannot dirty ram pages as quickly. This is nicer than > pausing CPUs altogether, but could still be an issue for guests > which have strong performance requirements > > - Page compression flag - if you set this QEMU does compression of > pages to reduce data that has to be sent. This is basically trading > off network bandwidth vs CPU burn. Probably a win unless you are > already highly overcomit on CPU on the host > > Fourth there's a question of whether we should give the tenant user or > cloud admin further APIs for influencing migration > > - Add an explicit API for cancelling migration ? > > - Add APIs for setting tunables like downtime, bandwidth on the fly ? > > - Or drive some of the tunables like downtime, bandwidth, or policies > like cancel vs paused from flavour or image metadata properties ? > > - Allow operations like evacuate to specify a live migration policy > eg switch non-live migrate after 5 minutes ? > > The current code is so crude and there's a hell of alot of options we > can take. I'm just not sure which is the best direction for us to go > in. > > What kind of things would be the biggest win from Operators' or tenants' > POV ? > > Regards, > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ > :| > |: http://libvirt.org -o- http://virt-manager.org > :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ > :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc > :| > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev