On Mon, Nov 23, 2015 at 08:36:32AM +0000, Paul Carlton wrote:
> John
> 
> At the live migration sub team meeting I undertook to look at the issue
> of progress reporting.
> 
> The use cases I'm envisaging are...
> 
> As a user I want to know how much longer my instance will be migrating
> for.
> 
> As an operator I want to identify any migration that are making slow
>  progress so I can expedite their progress or abort them.
> 
> The current implementation reports on the instance's migration with
> respect to memory transfer, using the total memory and memory remaining
> fields from libvirt to report the percentage of memory still to be
> transferred.  Due to the instance writing to pages already transferred
> this percentage can go up as well as down.  Daniel has done a good job
> of generating regular log records to report progress and highlight lack
> of progress but from the API all a user/operator can see is the current
> percentage complete.  By observing this periodically they can identify
> instance migrations that are struggling to migrate memory pages fast
> enough to keep pace with the instance's memory updates.
> 
> The problem is that at present we have only one field, the instance
> progress, to record progress.  With a live migration there are measures
> of progress, how much of the ephemeral disks (not needed for shared
> disk setups) have been copied and how much of the memory has been
> copied. Both can go up and down as the instance writes to pages already
> copied causing those pages to need to be copied again.  As Daniel says
> in his comments in the code, the disk size could dwarf the memory so
> reporting both in single percentage number is problematic.
> 
> We could add an additional progress item to the instance object, i.e.
> disk progress and memory progress but that seems odd to have an
> additional progress field only for this operation so this is probably
> a non starter!
> 
> For operations staff with access to log files we could report disk
> progress as well as memory in the log file, however that does not
> address the needs of users and whilst log files are the right place for
> support staff to look when investigating issues operational tooling
> is much better served by notification messages.
> 
> Thus I'd recommend generating periodic notifications during a migration
> to report both memory and disk progress would be useful?  Cloud
> operators are likely to manage their instance migration activity using
> some orchestration tooling which could consume these notifications and
> deduce what challenges the instance migration is encountering and thus
> determine how to address any issues.
> 
> The use cases are only partially addressed by the current
> implementation, they can repeatedly get the server details and look at
> the progress percentage to see how quickly (or even if) it is
> increasing and determine how long the instance is likely to be
> migrating for.  However for an instance that has a large disk and/or
> is doing a high rate of disk i/o they may see the percentage complete
> (i.e. memory) repeatedly showing 90%+ but the instance migration does
> not complete.
> 
> The nova spec https://review.openstack.org/#/c/248472/ suggests making
> detailed information available via the os-migrations object.  This is
> not a bad idea but I have some issues with the implementation that I
> will share on that spec.

As I mentioned in the spec, I won't support exposing anything other
than disk total + remaining via the API. All the other stats are
low level QEMU specific implementation details that I feel the public
API users have no business knowing about.

In general I think we need to be wary of exposing lots of info + knobs
via the API, as that direction essentially ends up forcing the problem
onto client application. The focus should really be on ensuring that
Nova consumes all these stats exposed by QEMU and makes decisions
itself based on that.

At most an external application should have information on the data
transfer progress. I'm not even convinced that applications should
need to be able to figure out if a live migration is stuck. I generally
think that any scenario in which a live migration can get stuck is a
bug in Nova's management of the migration process. IOW, the focus of
our efforts should be on ensuring Nova does the right thing to guarantee
that live migration will never get stuck. At which point an Nova client
user / application should really only care about the overall progress
of a live migration.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to