[openstack-dev] [Nova] Live Migration: Austin summit update

2016-04-29 Thread Murray, Paul (HP Cloud)
The following summarizes status of the main topics relating to live migration 
after the Newton design summit. Please feel free to correct any inaccuracies or 
add additional information.

Paul

-

Libvirt storage pools

The storage pools work has been selected as one of the project review 
priorities for Newton.
(see https://etherpad.openstack.org/p/newton-nova-summit-priorities )

Continuation of the libvirt storage pools work was discussed in the live 
migration session. The proposal has grown to include a refactor of the existing 
libvirt driver instance storage code. Justification for this is based on three 
factors:

1.   The code needs to be refactored to use storage pools

2.   The code is complicated and uses inspection, poor practice

3.   During the investigation Matt Booth discovered two CVEs in the code - 
suggesting further work is justified

So the proposal is now to follow three stages:

1.   Refactor the instance storage code

2.   Adapt to use storage pools for the instance storage

3.   Use storage pools to drive resize/migration

Matt has code already starting the refactor and will continue with help from 
Paul Carlton + Paul Murray. We will look for additional contributors to help as 
we plan out the patches.

https://review.openstack.org/#/c/302117 : Persist libvirt instance storage 
metadata
https://review.openstack.org/#/c/310505 : Use libvirt storage pools
https://review.openstack.org/#/c/310538 : Migrate libvirt volumes

Post copy

The spec to add post copy migration support in the libvirt driver was discussed 
in the live migration session. Post copy guarantees completion of a migration 
in linear time without needing to pause the VM. This can be used as an 
alternative to pausing in live-migration-force-complete. Pause or complete 
could also be invoked automatically under some circumstances. The issue slowing 
these specs is how to decide which method to use given they provide a different 
user experience but we don't want to expose virt specific features in the API. 
Two additional specs listed below suggest possible generic ways to address the 
issue.

There was no conclusions reached in the session so the debate will continue on 
the specs. The first below is the main spec for the feature.

https://review.openstack.org/#/c/301509 : Adds post-copy live migration support 
to Nova
https://review.openstack.org/#/c/305425 : Define instance availability profiles
https://review.openstack.org/#/c/306561 : Automatic Live Migration Completion

Live Migration orchestrated via conductor

The proposal to move orchestration of live migration to conductor was discussed 
in the working session on Friday, presented by Andrew Laski on behalf of 
Timofey Durakov. This one threw up a lot of debate both for and against the 
general idea, but not supporting the patches that have been submitted along 
with the spec so far. The general feeling was that we need to attack this, but 
need to take some simple first cleanup steps first to get a better idea of the 
problem. Dan Smith proposed moving the stateless pre-migration steps to a 
sequence of calls from conductor (as opposed to the going back and forth 
between computes) as the first step.

https://review.openstack.org/#/c/292271 : Remove compute-compute communication 
in live-migration

Cold and Live Migration Scheduling

When this patch merges all migrations will use the request spec for scheduling: 
https://review.openstack.org/#/c/284974
Work is still ongoing for check destinations (allowing the scheduler to check a 
destination chosen by the admin). When that is complete migrations will have 
three ways to be placed:

1.   Destination chosen by scheduler

2.   Destination chosen by admin but checked by scheduler

3.   Destination forced by admin

https://review.openstack.org/#/c/296408 Re-Proposes to check destination on 
migrations

PCI + NUMA claims

Moshe and Jay are making great progress refactoring Nicola's patches to fix PCI 
and NUMA handling in migrations. The patch series should be completed soon.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Live Migration: Austin summit update

2016-04-29 Thread Matt Riedemann

On 4/29/2016 5:32 PM, Murray, Paul (HP Cloud) wrote:

The following summarizes status of the main topics relating to live
migration after the Newton design summit. Please feel free to correct
any inaccuracies or add additional information.



Paul



-



Libvirt storage pools



The storage pools work has been selected as one of the project review
priorities for Newton.

(see https://etherpad.openstack.org/p/newton-nova-summit-priorities )



Continuation of the libvirt storage pools work was discussed in the live
migration session. The proposal has grown to include a refactor of the
existing libvirt driver instance storage code. Justification for this is
based on three factors:

1.   The code needs to be refactored to use storage pools

2.   The code is complicated and uses inspection, poor practice

3.   During the investigation Matt Booth discovered two CVEs in the
code – suggesting further work is justified



So the proposal is now to follow three stages:

1.   Refactor the instance storage code

2.   Adapt to use storage pools for the instance storage

3.   Use storage pools to drive resize/migration


We also talked about the need for some additional test coverage for the 
refactor work:


1. A job that uses LVM on the experimental queue.

2. ploop should be covered by the Virtuozzo Compute third party CI but 
we'll need to double-check the test coverage there (is it running the 
tests that hit the code paths being refactored). Note that they have 
their own blueprint for implementing resize for ploop:


https://blueprints.launchpad.net/nova/+spec/virtuozzo-instance-resize-support

3. Ceph testing - we already have a single-node job for Ceph that will 
test the resize paths. We should also be testing Ceph-backed live 
migration in the special live-migration job that Timofey has been 
working on.


4. NFS testing - this also falls into the special live migration CI job 
that will test live migration in different storage configurations within 
a single run.






Matt has code already starting the refactor and will continue with help
from Paul Carlton + Paul Murray. We will look for additional
contributors to help as we plan out the patches.



https://review.openstack.org/#/c/302117 : Persist libvirt instance
storage metadata

https://review.openstack.org/#/c/310505 : Use libvirt storage pools

https://review.openstack.org/#/c/310538 : Migrate libvirt volumes



Post copy



The spec to add post copy migration support in the libvirt driver was
discussed in the live migration session. Post copy guarantees completion
of a migration in linear time without needing to pause the VM. This can
be used as an alternative to pausing in live-migration-force-complete.
Pause or complete could also be invoked automatically under some
circumstances. The issue slowing these specs is how to decide which
method to use given they provide a different user experience but we
don’t want to expose virt specific features in the API. Two additional
specs listed below suggest possible generic ways to address the issue.



There was no conclusions reached in the session so the debate will
continue on the specs. The first below is the main spec for the feature.



https://review.openstack.org/#/c/301509 : Adds post-copy live migration
support to Nova

https://review.openstack.org/#/c/305425 : Define instance availability
profiles

https://review.openstack.org/#/c/306561 : Automatic Live Migration
Completion



Live Migration orchestrated via conductor



The proposal to move orchestration of live migration to conductor was
discussed in the working session on Friday, presented by Andrew Laski on
behalf of Timofey Durakov. This one threw up a lot of debate both for
and against the general idea, but not supporting the patches that have
been submitted along with the spec so far. The general feeling was that
we need to attack this, but need to take some simple first cleanup steps
first to get a better idea of the problem. Dan Smith proposed moving the
stateless pre-migration steps to a sequence of calls from conductor (as
opposed to the going back and forth between computes) as the first step.



https://review.openstack.org/#/c/292271 : Remove compute-compute
communication in live-migration



Cold and Live Migration Scheduling



When this patch merges all migrations will use the request spec for
scheduling: https://review.openstack.org/#/c/284974

Work is still ongoing for check destinations (allowing the scheduler to
check a destination chosen by the admin). When that is complete
migrations will have three ways to be placed:

1.   Destination chosen by scheduler

2.   Destination chosen by admin but checked by scheduler

3.   Destination forced by admin



https://review.openstack.org/#/c/296408 Re-Proposes to check destination
on migrations



PCI + NUMA claims



Moshe and Jay are making great progress refactoring Nicola’s patches t

Re: [openstack-dev] [Nova] Live Migration: Austin summit update

2016-04-30 Thread Murray, Paul (HP Cloud)

Thanks Matt, I meant to cover CI but clearly omitted it. 


> On 30 Apr 2016, at 02:35, Matt Riedemann  wrote:
> 
>> On 4/29/2016 5:32 PM, Murray, Paul (HP Cloud) wrote:
>> The following summarizes status of the main topics relating to live
>> migration after the Newton design summit. Please feel free to correct
>> any inaccuracies or add additional information.
>> 
>> 
>> 
>> Paul
>> 
>> 
>> 
>> -
>> 
>> 
>> 
>> Libvirt storage pools
>> 
>> 
>> 
>> The storage pools work has been selected as one of the project review
>> priorities for Newton.
>> 
>> (see https://etherpad.openstack.org/p/newton-nova-summit-priorities )
>> 
>> 
>> 
>> Continuation of the libvirt storage pools work was discussed in the live
>> migration session. The proposal has grown to include a refactor of the
>> existing libvirt driver instance storage code. Justification for this is
>> based on three factors:
>> 
>> 1.   The code needs to be refactored to use storage pools
>> 
>> 2.   The code is complicated and uses inspection, poor practice
>> 
>> 3.   During the investigation Matt Booth discovered two CVEs in the
>> code – suggesting further work is justified
>> 
>> 
>> 
>> So the proposal is now to follow three stages:
>> 
>> 1.   Refactor the instance storage code
>> 
>> 2.   Adapt to use storage pools for the instance storage
>> 
>> 3.   Use storage pools to drive resize/migration
> 
> We also talked about the need for some additional test coverage for the 
> refactor work:
> 
> 1. A job that uses LVM on the experimental queue.
> 
> 2. ploop should be covered by the Virtuozzo Compute third party CI but we'll 
> need to double-check the test coverage there (is it running the tests that 
> hit the code paths being refactored). Note that they have their own blueprint 
> for implementing resize for ploop:
> 
> https://blueprints.launchpad.net/nova/+spec/virtuozzo-instance-resize-support
> 
> 3. Ceph testing - we already have a single-node job for Ceph that will test 
> the resize paths. We should also be testing Ceph-backed live migration in the 
> special live-migration job that Timofey has been working on.
> 
> 4. NFS testing - this also falls into the special live migration CI job that 
> will test live migration in different storage configurations within a single 
> run.
> 
>> 
>> 
>> 
>> Matt has code already starting the refactor and will continue with help
>> from Paul Carlton + Paul Murray. We will look for additional
>> contributors to help as we plan out the patches.
>> 
>> 
>> 
>> https://review.openstack.org/#/c/302117 : Persist libvirt instance
>> storage metadata
>> 
>> https://review.openstack.org/#/c/310505 : Use libvirt storage pools
>> 
>> https://review.openstack.org/#/c/310538 : Migrate libvirt volumes
>> 
>> 
>> 
>> Post copy
>> 
>> 
>> 
>> The spec to add post copy migration support in the libvirt driver was
>> discussed in the live migration session. Post copy guarantees completion
>> of a migration in linear time without needing to pause the VM. This can
>> be used as an alternative to pausing in live-migration-force-complete.
>> Pause or complete could also be invoked automatically under some
>> circumstances. The issue slowing these specs is how to decide which
>> method to use given they provide a different user experience but we
>> don’t want to expose virt specific features in the API. Two additional
>> specs listed below suggest possible generic ways to address the issue.
>> 
>> 
>> 
>> There was no conclusions reached in the session so the debate will
>> continue on the specs. The first below is the main spec for the feature.
>> 
>> 
>> 
>> https://review.openstack.org/#/c/301509 : Adds post-copy live migration
>> support to Nova
>> 
>> https://review.openstack.org/#/c/305425 : Define instance availability
>> profiles
>> 
>> https://review.openstack.org/#/c/306561 : Automatic Live Migration
>> Completion
>> 
>> 
>> 
>> Live Migration orchestrated via conductor
>> 
>> 
>> 
>> The proposal to move orchestration of live migration to conductor was
>> discussed in the working session on Friday, presented by Andrew Laski on
>> behalf of Timofey Durakov. This one threw up a lot of debate both for
>> and against the general idea, but not supporting the patches that have
>> been submitted along with the spec so far. The general feeling was that
>> we need to attack this, but need to take some simple first cleanup steps
>> first to get a better idea of the problem. Dan Smith proposed moving the
>> stateless pre-migration steps to a sequence of calls from conductor (as
>> opposed to the going back and forth between computes) as the first step.
>> 
>> 
>> 
>> https://review.openstack.org/#/c/292271 : Remove compute-compute
>> communication in live-migration
>> 
>> 
>> 
>> Cold and Live Migration Scheduling
>> 
>> 
>> 
>> When this patch merges all migrations will use the request spec for
>> scheduling: https://review.openstack.org/#

Re: [openstack-dev] [Nova] Live Migration: Austin summit update

2016-05-03 Thread Daniel P. Berrange
On Fri, Apr 29, 2016 at 10:32:09PM +, Murray, Paul (HP Cloud) wrote:
> The following summarizes status of the main topics relating to live migration
> after the Newton design summit. Please feel free to correct any inaccuracies
> or add additional information.

> Post copy
> 
> The spec to add post copy migration support in the libvirt driver was
> discussed in the live migration session. Post copy guarantees completion
> of a migration in linear time without needing to pause the VM. This can
> be used as an alternative to pausing in live-migration-force-complete.
> Pause or complete could also be invoked automatically under some
> circumstances. The issue slowing these specs is how to decide which
> method to use given they provide a different user experience but we
> don't want to expose virt specific features in the API. Two additional
> specs listed below suggest possible generic ways to address the issue.
> 
> There was no conclusions reached in the session so the debate will
> continue on the specs. The first below is the main spec for the feature.
> 
> https://review.openstack.org/#/c/301509 : Adds post-copy live migration 
> support to Nova
> https://review.openstack.org/#/c/305425 : Define instance availability 
> profiles
> https://review.openstack.org/#/c/306561 : Automatic Live Migration Completion

There are currently many options for live migration with QEMU that can
assist in completion

 - Pause the VM
 - Auto-converge
 - XBZRLE compression
 - Multi-thread compression
 - Post-copy

Combined with tunables such as max-bandwidth and max-downtime. It is
absolutely clear as mud which of these work best for ensuring completion,
and what kind of impact they have on the guest performance.

Given this I've spent the last week creating an automated test harness
for QEMU upstream which triggers migration with an extreme guest CPU
load and measures the performance impact of these features on the guest,
and whether the migration actually completes.

I hope to be able to publish the results of this investigation this week
which should facilitate us in deciding which is best to use for OpenStack.
The spoiler though is that all the options are pretty terrible, except for
post-copy.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Live Migration: Austin summit update

2016-05-03 Thread Chris Friesen

On 05/03/2016 03:14 AM, Daniel P. Berrange wrote:


There are currently many options for live migration with QEMU that can
assist in completion





Given this I've spent the last week creating an automated test harness
for QEMU upstream which triggers migration with an extreme guest CPU
load and measures the performance impact of these features on the guest,
and whether the migration actually completes.

I hope to be able to publish the results of this investigation this week
which should facilitate us in deciding which is best to use for OpenStack.
The spoiler though is that all the options are pretty terrible, except for
post-copy.


Just to be clear, it's not really CPU load that's the issue though, right?

Presumably it would be more accurate to say that the issue is the rate at which 
unique memory pages are being dirtied and the total number of dirty pages 
relative to your copy bandwidth.


This probably doesn't change the results though...at a high enough dirty rate 
you either pause the VM to keep it from dirtying more memory or you post-copy 
migrate and dirty the memory on the destination.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Live Migration: Austin summit update

2016-05-04 Thread Daniel P. Berrange
On Tue, May 03, 2016 at 04:16:43PM -0600, Chris Friesen wrote:
> On 05/03/2016 03:14 AM, Daniel P. Berrange wrote:
> 
> >There are currently many options for live migration with QEMU that can
> >assist in completion
> 
> 
> 
> >Given this I've spent the last week creating an automated test harness
> >for QEMU upstream which triggers migration with an extreme guest CPU
> >load and measures the performance impact of these features on the guest,
> >and whether the migration actually completes.
> >
> >I hope to be able to publish the results of this investigation this week
> >which should facilitate us in deciding which is best to use for OpenStack.
> >The spoiler though is that all the options are pretty terrible, except for
> >post-copy.
> 
> Just to be clear, it's not really CPU load that's the issue though, right?
> 
> Presumably it would be more accurate to say that the issue is the rate at
> which unique memory pages are being dirtied and the total number of dirty
> pages relative to your copy bandwidth.
> 
> This probably doesn't change the results though...at a high enough dirty
> rate you either pause the VM to keep it from dirtying more memory or you
> post-copy migrate and dirty the memory on the destination.

Yes that's correct - I should have been more explicit. A high rate of
dirtying memory implies high CPU load, but high CPU load does not imply
high rate of dirtying memory. My stress test used for benchmarking is
producing a high rate of dirtying memory.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Live Migration: Austin summit update

2016-05-05 Thread Timofei Durakov
We've tested live migration of instances under load for following features:
 - Auto-converge
 - XBZRLE compression
Tests were done against qemu 2.5. Instances had 2GB ram and stress tool was
used to emulate load:

$ stress -m 1 --vm-bytes 200M --vm-keep

None of them could be called silver-bullet for live-migration process.
Load mentioned above made live-migration process not that stable as
wanted.
I've also planned to tests all features that Daniel mentioned and it would
be intresting
to compare results.

Timofey.

On Wed, May 4, 2016 at 11:48 AM, Daniel P. Berrange 
wrote:

> On Tue, May 03, 2016 at 04:16:43PM -0600, Chris Friesen wrote:
> > On 05/03/2016 03:14 AM, Daniel P. Berrange wrote:
> >
> > >There are currently many options for live migration with QEMU that can
> > >assist in completion
> >
> > 
> >
> > >Given this I've spent the last week creating an automated test harness
> > >for QEMU upstream which triggers migration with an extreme guest CPU
> > >load and measures the performance impact of these features on the guest,
> > >and whether the migration actually completes.
> > >
> > >I hope to be able to publish the results of this investigation this week
> > >which should facilitate us in deciding which is best to use for
> OpenStack.
> > >The spoiler though is that all the options are pretty terrible, except
> for
> > >post-copy.
> >
> > Just to be clear, it's not really CPU load that's the issue though,
> right?
> >
> > Presumably it would be more accurate to say that the issue is the rate at
> > which unique memory pages are being dirtied and the total number of dirty
> > pages relative to your copy bandwidth.
> >
> > This probably doesn't change the results though...at a high enough dirty
> > rate you either pause the VM to keep it from dirtying more memory or you
> > post-copy migrate and dirty the memory on the destination.
>
> Yes that's correct - I should have been more explicit. A high rate of
> dirtying memory implies high CPU load, but high CPU load does not imply
> high rate of dirtying memory. My stress test used for benchmarking is
> producing a high rate of dirtying memory.
>
> Regards,
> Daniel
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org  -o- http://virt-manager.org
> :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/
> :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc
> :|
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev