Re: [openstack-dev] [nova] Migration progress

2016-02-08 Thread Timofei Durakov
Hi,

In case of live-migration reporting I'd rather go with real-time stats,
queried from compute, instead of reporting this data to db first. While
amount of of rpc requests/db updates is relatively small, total number of
such requests depends on amount of active migrations. While realtime data
from compute allows to decrease it. Not every migration will trigger
operator to gather statistics, and each of triggered will require only 2
rpc per request instead of 2 rpc and db write per 3/5/etc. seconds.

Timofey.

On Sun, Feb 7, 2016 at 10:31 PM, Jay Pipes  wrote:

> On 02/04/2016 11:02 PM, Bhandaru, Malini K wrote:
>
>> Another thought, for such ephemeral/changing data, such as progress,
>> why not save the information in the cache (and flush to database at a
>> lower rate), and retrieve for display to active listeners/UI from the
>> cache. Once complete or aborted, of course flush the cache.
>>
>> Also should we provide a "verbose flag", that is only capture
>> progress information when requested? That is when a human user might
>> be issuing the command from the cli or GUI tool.
>>
>
> I agree with you, Malini, on the above suggestion that there is some doubt
> as to the value of saving this temporal data to the database.
>
> Why not just have an on-demand model that simply routes the request for
> progress information directly to the compute node and sends the progress
> amount back directly to the nova-api service instead of going to the
> database at all?
>
> Another alternative would be to use a push model instead of a poll model,
> but that would require a pretty significant change to the code...
>
> Best,
> -jay
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2016-02-07 Thread Jay Pipes

On 02/04/2016 11:02 PM, Bhandaru, Malini K wrote:

Another thought, for such ephemeral/changing data, such as progress,
why not save the information in the cache (and flush to database at a
lower rate), and retrieve for display to active listeners/UI from the
cache. Once complete or aborted, of course flush the cache.

Also should we provide a "verbose flag", that is only capture
progress information when requested? That is when a human user might
be issuing the command from the cli or GUI tool.


I agree with you, Malini, on the above suggestion that there is some 
doubt as to the value of saving this temporal data to the database.


Why not just have an on-demand model that simply routes the request for 
progress information directly to the compute node and sends the progress 
amount back directly to the nova-api service instead of going to the 
database at all?


Another alternative would be to use a push model instead of a poll 
model, but that would require a pretty significant change to the code...


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2016-02-04 Thread Zhenyu Zheng
I think we can add a config option for this and set a theoretical proper
default value,
we also add help messages to inform the the user about how inappropriate
value of
this config option will effect the performance.



On Wed, Feb 3, 2016 at 7:45 PM, Daniel P. Berrange 
wrote:

> On Wed, Feb 03, 2016 at 11:27:16AM +, Paul Carlton wrote:
> > On 03/02/16 10:49, Daniel P. Berrange wrote:
> > >On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote:
> > >>On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote:
> > >>>Hello everyone,
> > >>>
> > >>>On the yesterday's live migration meeting we had concerns that
> interval of
> > >>>writing migration progress to the database is too short.
> > >>>
> > >>>Information about migration progress will be stored in the database
> and
> > >>>exposed through the API (/servers//migrations/). In current
> > >>>proposition [1] migration progress will be updated every 2 seconds. It
> > >>>basically means that every 2 seconds a call through RPC will go from
> compute
> > >>>to conductor to write migration data to the database. In case of
> parallel
> > >>>live migrations each migration will report progress by itself.
> > >>>
> > >>>Isn't 2 seconds interval too short for updates if the information is
> exposed
> > >>>through the API and it requires RPC and DB call to actually save it
> in the
> > >>>DB?
> > >>>
> > >>>Our default configuration allows only for 1 concurrent live migration
> [2],
> > >>>but it might vary between different deployments and use cases as it is
> > >>>configurable. Someone might want to trigger 10 (or even more)
> parallel live
> > >>>migrations and each might take even a day to finish in case of block
> > >>>migration. Also if deployment is big enough rabbitmq might be
> fully-loaded.
> > >>>I'm not sure whether updating each migration every 2 seconds makes
> sense in
> > >>>this case. On the other hand it might be hard to observe fast enough
> that
> > >>>migration is stuck if we increase this interval...
> > >>Do we have any actual data that this is a real problem. I have a
> pretty hard
> > >>time believing that a database update of a single field every 2
> seconds is
> > >>going to be what pushes Nova over the edge into a performance
> collapse, even
> > >>if there are 20 migrations running in parallel, when you compare it to
> the
> > >>amount of DB queries & updates done across other areas of the code for
> pretty
> > >>much every singke API call and background job.
> > >Also note that progress is rounded to the nearest integer. So even if
> the
> > >migration runs all day, there is a maximum of 100 possible changes in
> value
> > >for the progress field, so most of the updates should turn in to no-ops
> at
> > >the database level.
> > >
> > >Regards,
> > >Daniel
> > I agree with Daniel, these rpc and db access ops are a tiny percentage
> > of the overall load on rabbit and mysql and properly configured these
> > subsystems should have no issues with this workload.
> >
> > One correction, unless I'm misreading it, the existing
> > _live_migration_monitor code updates the progress field of the instance
> > record every 5 seconds.  However this value can go up and down so
> > an infinate number of updates are possible?
>
> Oh yes, you are in fact correct. Technically you could have an unbounded
> number of updates if migration goes backwards. Some mitigation against
> this is if we see progress going backwards we'll actually abort the
> migration if it gets stuck for too long. We'll also be progressively
> increasing the permitted downtime. So except in pathelogical scenarios
> I think the number of updates should still be relatively small.
>
> > However, the issue raised here is not with the existing implementation
> > but with the proposed change
> > https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py
> > This add a save() operation on the migration object every 2 seconds
>
> Ok, that is more heavy weight since it is recording the raw byte values
> and so it is guaranteed to do a database update pretty much every time.
> It still shouldn't be too unreasonable a loading though. FWIW I think
> it is worth being consistent in the update frequency betweeen the
> progress value & the migration object save, so switching to be every
> 5 seconds probably makes more sense, so we know both objects are
> reflecting the same point in time.
>
> Regards,
> Daniel
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org  -o- http://virt-manager.org
> :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/
> :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc
> :|
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: 

Re: [openstack-dev] [nova] Migration progress

2016-02-04 Thread Bhandaru, Malini K
I agree with Daniel,  keep the periods consistent 5 - 5 .

Another thought, for such ephemeral/changing data, such as progress, why not 
save the information in the cache (and flush to database at a lower rate), and 
retrieve for display to active listeners/UI from the cache. Once complete or 
aborted, of course flush the cache.

Also should we provide a "verbose flag", that is only capture progress 
information when requested? That is when a human user might be issuing the 
command from the cli or GUI tool.

Regards
Malini

-Original Message-
From: Daniel P. Berrange [mailto:berra...@redhat.com] 
Sent: Wednesday, February 03, 2016 11:46 AM
To: Paul Carlton <paul.carlt...@hpe.com>
Cc: Feng, Shaohe <shaohe.f...@intel.com>; OpenStack Development Mailing List 
(not for usage questions) <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] Migration progress

On Wed, Feb 03, 2016 at 11:27:16AM +, Paul Carlton wrote:
> On 03/02/16 10:49, Daniel P. Berrange wrote:
> >On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote:
> >>On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote:
> >>>Hello everyone,
> >>>
> >>>On the yesterday's live migration meeting we had concerns that 
> >>>interval of writing migration progress to the database is too short.
> >>>
> >>>Information about migration progress will be stored in the database 
> >>>and exposed through the API (/servers//migrations/). In 
> >>>current proposition [1] migration progress will be updated every 2 
> >>>seconds. It basically means that every 2 seconds a call through RPC 
> >>>will go from compute to conductor to write migration data to the 
> >>>database. In case of parallel live migrations each migration will report 
> >>>progress by itself.
> >>>
> >>>Isn't 2 seconds interval too short for updates if the information 
> >>>is exposed through the API and it requires RPC and DB call to 
> >>>actually save it in the DB?
> >>>
> >>>Our default configuration allows only for 1 concurrent live 
> >>>migration [2], but it might vary between different deployments and 
> >>>use cases as it is configurable. Someone might want to trigger 10 
> >>>(or even more) parallel live migrations and each might take even a 
> >>>day to finish in case of block migration. Also if deployment is big enough 
> >>>rabbitmq might be fully-loaded.
> >>>I'm not sure whether updating each migration every 2 seconds makes 
> >>>sense in this case. On the other hand it might be hard to observe 
> >>>fast enough that migration is stuck if we increase this interval...
> >>Do we have any actual data that this is a real problem. I have a 
> >>pretty hard time believing that a database update of a single field 
> >>every 2 seconds is going to be what pushes Nova over the edge into a 
> >>performance collapse, even if there are 20 migrations running in 
> >>parallel, when you compare it to the amount of DB queries & updates 
> >>done across other areas of the code for pretty much every singke API call 
> >>and background job.
> >Also note that progress is rounded to the nearest integer. So even if 
> >the migration runs all day, there is a maximum of 100 possible 
> >changes in value for the progress field, so most of the updates 
> >should turn in to no-ops at the database level.
> >
> >Regards,
> >Daniel
> I agree with Daniel, these rpc and db access ops are a tiny percentage 
> of the overall load on rabbit and mysql and properly configured these 
> subsystems should have no issues with this workload.
> 
> One correction, unless I'm misreading it, the existing 
> _live_migration_monitor code updates the progress field of the 
> instance record every 5 seconds.  However this value can go up and 
> down so an infinate number of updates are possible?

Oh yes, you are in fact correct. Technically you could have an unbounded number 
of updates if migration goes backwards. Some mitigation against this is if we 
see progress going backwards we'll actually abort the migration if it gets 
stuck for too long. We'll also be progressively increasing the permitted 
downtime. So except in pathelogical scenarios I think the number of updates 
should still be relatively small.

> However, the issue raised here is not with the existing implementation 
> but with the proposed change 
> https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py
> This add a save() operation on the migration object every 2 seconds

Ok, that is more heavy weight sinc

Re: [openstack-dev] [nova] Migration progress

2016-02-04 Thread Eli Qiao



On 2016年02月05日 12:02, Bhandaru, Malini K wrote:

I agree with Daniel,  keep the periods consistent 5 - 5 .

Another thought, for such ephemeral/changing data, such as progress, why not 
save the information in the cache (and flush to database at a lower rate), and 
retrieve for display to active listeners/UI from the cache. Once complete or 
aborted, of course flush the cache.

hi Malini
It's good idea to use cache to save the information while doing 
migration, but the problem is how can we access that cache while we use 
CLI (nova-api)?
These information are generated from nova-compute node , there should be 
one method to sync them to nova-conductor(which means DB).

Also should we provide a "verbose flag", that is only capture progress 
information when requested? That is when a human user might be issuing the command from 
the cli or GUI tool.

I am +1 on this, yeah, some of other service may help.

--
Best Regards, Eli(Li Yong)Qiao
Intel OTC China

<>__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2016-02-03 Thread Daniel P. Berrange
On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote:
> On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote:
> > Hello everyone,
> > 
> > On the yesterday's live migration meeting we had concerns that interval of
> > writing migration progress to the database is too short.
> > 
> > Information about migration progress will be stored in the database and
> > exposed through the API (/servers//migrations/). In current
> > proposition [1] migration progress will be updated every 2 seconds. It
> > basically means that every 2 seconds a call through RPC will go from compute
> > to conductor to write migration data to the database. In case of parallel
> > live migrations each migration will report progress by itself.
> > 
> > Isn't 2 seconds interval too short for updates if the information is exposed
> > through the API and it requires RPC and DB call to actually save it in the
> > DB?
> > 
> > Our default configuration allows only for 1 concurrent live migration [2],
> > but it might vary between different deployments and use cases as it is
> > configurable. Someone might want to trigger 10 (or even more) parallel live
> > migrations and each might take even a day to finish in case of block
> > migration. Also if deployment is big enough rabbitmq might be fully-loaded.
> > I'm not sure whether updating each migration every 2 seconds makes sense in
> > this case. On the other hand it might be hard to observe fast enough that
> > migration is stuck if we increase this interval...
> 
> Do we have any actual data that this is a real problem. I have a pretty hard
> time believing that a database update of a single field every 2 seconds is
> going to be what pushes Nova over the edge into a performance collapse, even
> if there are 20 migrations running in parallel, when you compare it to the
> amount of DB queries & updates done across other areas of the code for pretty
> much every singke API call and background job.

Also note that progress is rounded to the nearest integer. So even if the
migration runs all day, there is a maximum of 100 possible changes in value
for the progress field, so most of the updates should turn in to no-ops at
the database level.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2016-02-03 Thread Daniel P. Berrange
On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote:
> Hello everyone,
> 
> On the yesterday's live migration meeting we had concerns that interval of
> writing migration progress to the database is too short.
> 
> Information about migration progress will be stored in the database and
> exposed through the API (/servers//migrations/). In current
> proposition [1] migration progress will be updated every 2 seconds. It
> basically means that every 2 seconds a call through RPC will go from compute
> to conductor to write migration data to the database. In case of parallel
> live migrations each migration will report progress by itself.
> 
> Isn't 2 seconds interval too short for updates if the information is exposed
> through the API and it requires RPC and DB call to actually save it in the
> DB?
> 
> Our default configuration allows only for 1 concurrent live migration [2],
> but it might vary between different deployments and use cases as it is
> configurable. Someone might want to trigger 10 (or even more) parallel live
> migrations and each might take even a day to finish in case of block
> migration. Also if deployment is big enough rabbitmq might be fully-loaded.
> I'm not sure whether updating each migration every 2 seconds makes sense in
> this case. On the other hand it might be hard to observe fast enough that
> migration is stuck if we increase this interval...

Do we have any actual data that this is a real problem. I have a pretty hard
time believing that a database update of a single field every 2 seconds is
going to be what pushes Nova over the edge into a performance collapse, even
if there are 20 migrations running in parallel, when you compare it to the
amount of DB queries & updates done across other areas of the code for pretty
much every singke API call and background job.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2016-02-03 Thread Murray, Paul (HP Cloud)


> -Original Message-
> From: Daniel P. Berrange [mailto:berra...@redhat.com]
> Sent: 03 February 2016 10:49
> To: OpenStack Development Mailing List (not for usage questions)
> Cc: Feng, Shaohe
> Subject: Re: [openstack-dev] [nova] Migration progress
> 
> On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote:
> > On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote:
> > > Hello everyone,
> > >
> > > On the yesterday's live migration meeting we had concerns that
> > > interval of writing migration progress to the database is too short.
> > >
> > > Information about migration progress will be stored in the database
> > > and exposed through the API (/servers//migrations/). In
> > > current proposition [1] migration progress will be updated every 2
> > > seconds. It basically means that every 2 seconds a call through RPC
> > > will go from compute to conductor to write migration data to the
> > > database. In case of parallel live migrations each migration will report
> progress by itself.
> > >
> > > Isn't 2 seconds interval too short for updates if the information is
> > > exposed through the API and it requires RPC and DB call to actually
> > > save it in the DB?
> > >
> > > Our default configuration allows only for 1 concurrent live
> > > migration [2], but it might vary between different deployments and
> > > use cases as it is configurable. Someone might want to trigger 10
> > > (or even more) parallel live migrations and each might take even a
> > > day to finish in case of block migration. Also if deployment is big enough
> rabbitmq might be fully-loaded.
> > > I'm not sure whether updating each migration every 2 seconds makes
> > > sense in this case. On the other hand it might be hard to observe
> > > fast enough that migration is stuck if we increase this interval...
> >
> > Do we have any actual data that this is a real problem. I have a
> > pretty hard time believing that a database update of a single field
> > every 2 seconds is going to be what pushes Nova over the edge into a
> > performance collapse, even if there are 20 migrations running in
> > parallel, when you compare it to the amount of DB queries & updates
> > done across other areas of the code for pretty much every singke API call
> and background job.

As a data point: when we were doing live migrations in HP public cloud for 
rolling updates we were maintaining approximately 150 concurrent migrations 
through the process. At 2s intervals that would make approx. 75 updates per 
second. We don't feel that would have been a problem.

We also spoke to Michael Still and he thought it wouldn't be a problem for Rack 
Space (remembering they have cells). Having said that I have no idea of numbers 
I their case and would rather they spoke for themselves. In this thread.



> 
> Also note that progress is rounded to the nearest integer. So even if the
> migration runs all day, there is a maximum of 100 possible changes in value
> for the progress field, so most of the updates should turn in to no-ops at the
> database level.
> 
> Regards,
> Daniel
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org  -o- http://virt-manager.org :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
> 
> __
> 
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-
> requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2016-02-03 Thread Paul Carlton

On 03/02/16 10:49, Daniel P. Berrange wrote:

On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote:

On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote:

Hello everyone,

On the yesterday's live migration meeting we had concerns that interval of
writing migration progress to the database is too short.

Information about migration progress will be stored in the database and
exposed through the API (/servers//migrations/). In current
proposition [1] migration progress will be updated every 2 seconds. It
basically means that every 2 seconds a call through RPC will go from compute
to conductor to write migration data to the database. In case of parallel
live migrations each migration will report progress by itself.

Isn't 2 seconds interval too short for updates if the information is exposed
through the API and it requires RPC and DB call to actually save it in the
DB?

Our default configuration allows only for 1 concurrent live migration [2],
but it might vary between different deployments and use cases as it is
configurable. Someone might want to trigger 10 (or even more) parallel live
migrations and each might take even a day to finish in case of block
migration. Also if deployment is big enough rabbitmq might be fully-loaded.
I'm not sure whether updating each migration every 2 seconds makes sense in
this case. On the other hand it might be hard to observe fast enough that
migration is stuck if we increase this interval...

Do we have any actual data that this is a real problem. I have a pretty hard
time believing that a database update of a single field every 2 seconds is
going to be what pushes Nova over the edge into a performance collapse, even
if there are 20 migrations running in parallel, when you compare it to the
amount of DB queries & updates done across other areas of the code for pretty
much every singke API call and background job.

Also note that progress is rounded to the nearest integer. So even if the
migration runs all day, there is a maximum of 100 possible changes in value
for the progress field, so most of the updates should turn in to no-ops at
the database level.

Regards,
Daniel

I agree with Daniel, these rpc and db access ops are a tiny percentage
of the overall load on rabbit and mysql and properly configured these
subsystems should have no issues with this workload.

One correction, unless I'm misreading it, the existing
_live_migration_monitor code updates the progress field of the instance
record every 5 seconds.  However this value can go up and down so
an infinate number of updates are possible?

However, the issue raised here is not with the existing implementation
but with the proposed change
https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py
This add a save() operation on the migration object every 2 seconds

Paul Carlton
Software Engineer
Cloud Services
Hewlett Packard Enterprise
BUK03:T242
Longdown Avenue
Stoke Gifford
Bristol BS34 8QZ

Mobile:+44 (0)7768 994283
Office:+44 (0)117 316 2189
Email:mailto:paul.carlt...@hpe.com
irc:  paul-carlton2

Hewlett-Packard Enterprise Limited registered Office: Cain Road, Bracknell, 
Berks RG12 1HN Registered No: 690597 England.
The contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error, you should delete it from 
your system immediately and advise the sender. To any recipient of this message within 
HP, unless otherwise stated you should consider this message and attachments as "HP 
CONFIDENTIAL".


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2016-02-03 Thread Daniel P. Berrange
On Wed, Feb 03, 2016 at 11:27:16AM +, Paul Carlton wrote:
> On 03/02/16 10:49, Daniel P. Berrange wrote:
> >On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote:
> >>On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote:
> >>>Hello everyone,
> >>>
> >>>On the yesterday's live migration meeting we had concerns that interval of
> >>>writing migration progress to the database is too short.
> >>>
> >>>Information about migration progress will be stored in the database and
> >>>exposed through the API (/servers//migrations/). In current
> >>>proposition [1] migration progress will be updated every 2 seconds. It
> >>>basically means that every 2 seconds a call through RPC will go from 
> >>>compute
> >>>to conductor to write migration data to the database. In case of parallel
> >>>live migrations each migration will report progress by itself.
> >>>
> >>>Isn't 2 seconds interval too short for updates if the information is 
> >>>exposed
> >>>through the API and it requires RPC and DB call to actually save it in the
> >>>DB?
> >>>
> >>>Our default configuration allows only for 1 concurrent live migration [2],
> >>>but it might vary between different deployments and use cases as it is
> >>>configurable. Someone might want to trigger 10 (or even more) parallel live
> >>>migrations and each might take even a day to finish in case of block
> >>>migration. Also if deployment is big enough rabbitmq might be fully-loaded.
> >>>I'm not sure whether updating each migration every 2 seconds makes sense in
> >>>this case. On the other hand it might be hard to observe fast enough that
> >>>migration is stuck if we increase this interval...
> >>Do we have any actual data that this is a real problem. I have a pretty hard
> >>time believing that a database update of a single field every 2 seconds is
> >>going to be what pushes Nova over the edge into a performance collapse, even
> >>if there are 20 migrations running in parallel, when you compare it to the
> >>amount of DB queries & updates done across other areas of the code for 
> >>pretty
> >>much every singke API call and background job.
> >Also note that progress is rounded to the nearest integer. So even if the
> >migration runs all day, there is a maximum of 100 possible changes in value
> >for the progress field, so most of the updates should turn in to no-ops at
> >the database level.
> >
> >Regards,
> >Daniel
> I agree with Daniel, these rpc and db access ops are a tiny percentage
> of the overall load on rabbit and mysql and properly configured these
> subsystems should have no issues with this workload.
> 
> One correction, unless I'm misreading it, the existing
> _live_migration_monitor code updates the progress field of the instance
> record every 5 seconds.  However this value can go up and down so
> an infinate number of updates are possible?

Oh yes, you are in fact correct. Technically you could have an unbounded
number of updates if migration goes backwards. Some mitigation against
this is if we see progress going backwards we'll actually abort the
migration if it gets stuck for too long. We'll also be progressively
increasing the permitted downtime. So except in pathelogical scenarios
I think the number of updates should still be relatively small.

> However, the issue raised here is not with the existing implementation
> but with the proposed change
> https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py
> This add a save() operation on the migration object every 2 seconds

Ok, that is more heavy weight since it is recording the raw byte values
and so it is guaranteed to do a database update pretty much every time.
It still shouldn't be too unreasonable a loading though. FWIW I think
it is worth being consistent in the update frequency betweeen the
progress value & the migration object save, so switching to be every
5 seconds probably makes more sense, so we know both objects are
reflecting the same point in time.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2015-11-24 Thread 少合冯
Hi Paul,
Comments inline:

2015-11-23 16:36 GMT+08:00 Paul Carlton :

> John
>
> At the live migration sub team meeting I undertook to look at the issue
> of progress reporting.
>
> The use cases I'm envisaging are...
>
> As a user I want to know how much longer my instance will be migrating
> for.
>
> As an operator I want to identify any migration that are making slow
>  progress so I can expedite their progress or abort them.
>
> The current implementation reports on the instance's migration with
> respect to memory transfer, using the total memory and memory remaining
> fields from libvirt to report the percentage of memory still to be
> transferred.  Due to the instance writing to pages already transferred
> this percentage can go up as well as down.  Daniel has done a good job
> of generating regular log records to report progress and highlight lack
> of progress but from the API all a user/operator can see is the current
> percentage complete.  By observing this periodically they can identify
> instance migrations that are struggling to migrate memory pages fast
> enough to keep pace with the instance's memory updates.
>
> The problem is that at present we have only one field, the instance
> progress, to record progress.  With a live migration there are measures
>

[Shaohe]:

>From this link, OpenStack API ref:
http://developer.openstack.org/api-ref-compute-v2.1.html#listDetailServers
It describe the instance progress: A percentage value of the build progress.
But for libvirt driver it does be migration progress.
For other driver it is building progress.
And there is a spec to propose some change.
https://review.openstack.org/#/c/249086/



> of progress, how much of the ephemeral disks (not needed for shared
> disk setups) have been copied and how much of the memory has been
> copied. Both can go up and down as the instance writes to pages already
> copied causing those pages to need to be copied again.  As Daniel says
> in his comments in the code, the disk size could dwarf the memory so
> reporting both in single percentage number is problematic.
>
> We could add an additional progress item to the instance object, i.e.
> disk progress and memory progress but that seems odd to have an
> additional progress field only for this operation so this is probably
> a non starter!
>
> For operations staff with access to log files we could report disk
> progress as well as memory in the log file, however that does not
> address the needs of users and whilst log files are the right place for
> support staff to look when investigating issues operational tooling
> is much better served by notification messages.
>
> Thus I'd recommend generating periodic notifications during a migration
> to report both memory and disk progress would be useful?  Cloud
> operators are likely to manage their instance migration activity using
> some orchestration tooling which could consume these notifications and
> deduce what challenges the instance migration is encountering and thus
> determine how to address any issues.
>
> The use cases are only partially addressed by the current
> implementation, they can repeatedly get the server details and look at
> the progress percentage to see how quickly (or even if) it is
> increasing and determine how long the instance is likely to be
> migrating for.  However for an instance that has a large disk and/or
> is doing a high rate of disk i/o they may see the percentage complete
> (i.e. memory) repeatedly showing 90%+ but the instance migration does
> not complete.
>
> The nova spec https://review.openstack.org/#/c/248472/ suggests making
> detailed information available via the os-migrations object.  This is
> not a bad idea but I have some issues with the implementation that I
> will share on that spec.
>

[Shaohe]:

About this spec, Daniel has give some comments on it, and we have updated it.
Maybe we can work together on it to make it more better.

I have worked on libvirt multi-thread compress migration for libvirt. and looks
into some live migrations performance optimizations.

and generate an  ideas:
1. Let nova expose more live migration
details, such as the RAM statistics, xbzrle-cache status, also the information
of multi-thread compression in future, and so on.
2. nova can enable auto-converge, tune
the xbzrle-cache and multi-thread compression dynamically.
3. Then other project can make a good
strategy to tune the live migration base on the migration details.


For example:
cache size is a performance key for xbzrle,  the best is that the cache size are
same with the guest total RAM, but this maybe not always available on host.
Multi-thread compress level is higher is better, but it is cpu consume,
Auto converge will slow down the CPU running.
Seems things not always as good as I had expected.

Also we have submit a topic to summit about this idea, but not accepted.
Topic: 
Link: 

Re: [openstack-dev] [nova] Migration progress

2015-11-23 Thread John Garbutt
On 23 November 2015 at 08:36, Paul Carlton  wrote:
> John
>
> At the live migration sub team meeting I undertook to look at the issue
> of progress reporting.
>
> The use cases I'm envisaging are...
>
> As a user I want to know how much longer my instance will be migrating
> for.
>
> As an operator I want to identify any migration that are making slow
>  progress so I can expedite their progress or abort them.

+1

Agreed with this need.

Proposals to add pause and cancel clearly make this need more acute.

> The current implementation reports on the instance's migration with
> respect to memory transfer, using the total memory and memory remaining
> fields from libvirt to report the percentage of memory still to be
> transferred.  Due to the instance writing to pages already transferred
> this percentage can go up as well as down.  Daniel has done a good job
> of generating regular log records to report progress and highlight lack
> of progress but from the API all a user/operator can see is the current
> percentage complete.  By observing this periodically they can identify
> instance migrations that are struggling to migrate memory pages fast
> enough to keep pace with the instance's memory updates.
>
> The problem is that at present we have only one field, the instance
> progress, to record progress.  With a live migration there are measures
> of progress, how much of the ephemeral disks (not needed for shared
> disk setups) have been copied and how much of the memory has been
> copied. Both can go up and down as the instance writes to pages already
> copied causing those pages to need to be copied again.  As Daniel says
> in his comments in the code, the disk size could dwarf the memory so
> reporting both in single percentage number is problematic.
>
> We could add an additional progress item to the instance object, i.e.
> disk progress and memory progress but that seems odd to have an
> additional progress field only for this operation so this is probably
> a non starter!
>
> For operations staff with access to log files we could report disk
> progress as well as memory in the log file, however that does not
> address the needs of users and whilst log files are the right place for
> support staff to look when investigating issues operational tooling
> is much better served by notification messages.
>
> Thus I'd recommend generating periodic notifications during a migration
> to report both memory and disk progress would be useful?  Cloud
> operators are likely to manage their instance migration activity using
> some orchestration tooling which could consume these notifications and
> deduce what challenges the instance migration is encountering and thus
> determine how to address any issues.

To be clear, our notifications are not designed to be consumed by end users.

> The use cases are only partially addressed by the current
> implementation, they can repeatedly get the server details and look at
> the progress percentage to see how quickly (or even if) it is
> increasing and determine how long the instance is likely to be
> migrating for.  However for an instance that has a large disk and/or
> is doing a high rate of disk i/o they may see the percentage complete
> (i.e. memory) repeatedly showing 90%+ but the instance migration does
> not complete.

Agreed reporting progress, particularly with live-migrate, is awful right now.

Long term, I have my eye on this work:
https://etherpad.openstack.org/p/liberty-cross-project-user-notifications

But we should work on getting a good conceptual model for the progress
that can be exposed using the above system.

> The nova spec https://review.openstack.org/#/c/248472/ suggests making
> detailed information available via the os-migrations object.  This is
> not a bad idea but I have some issues with the implementation that I
> will share on that spec.

We do also need something that works across all hypervisor types.

Lets talk more on that spec review.

Thanks,
johnthetubaguy

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Migration progress

2015-11-23 Thread Paul Carlton



On 23/11/15 11:02, John Garbutt wrote:

On 23 November 2015 at 08:36, Paul Carlton  wrote:

John

At the live migration sub team meeting I undertook to look at the issue
of progress reporting.

The use cases I'm envisaging are...

As a user I want to know how much longer my instance will be migrating
for.

As an operator I want to identify any migration that are making slow
  progress so I can expedite their progress or abort them.

+1

Agreed with this need.

Proposals to add pause and cancel clearly make this need more acute.


The current implementation reports on the instance's migration with
respect to memory transfer, using the total memory and memory remaining
fields from libvirt to report the percentage of memory still to be
transferred.  Due to the instance writing to pages already transferred
this percentage can go up as well as down.  Daniel has done a good job
of generating regular log records to report progress and highlight lack
of progress but from the API all a user/operator can see is the current
percentage complete.  By observing this periodically they can identify
instance migrations that are struggling to migrate memory pages fast
enough to keep pace with the instance's memory updates.

The problem is that at present we have only one field, the instance
progress, to record progress.  With a live migration there are measures
of progress, how much of the ephemeral disks (not needed for shared
disk setups) have been copied and how much of the memory has been
copied. Both can go up and down as the instance writes to pages already
copied causing those pages to need to be copied again.  As Daniel says
in his comments in the code, the disk size could dwarf the memory so
reporting both in single percentage number is problematic.

We could add an additional progress item to the instance object, i.e.
disk progress and memory progress but that seems odd to have an
additional progress field only for this operation so this is probably
a non starter!

For operations staff with access to log files we could report disk
progress as well as memory in the log file, however that does not
address the needs of users and whilst log files are the right place for
support staff to look when investigating issues operational tooling
is much better served by notification messages.

Thus I'd recommend generating periodic notifications during a migration
to report both memory and disk progress would be useful?  Cloud
operators are likely to manage their instance migration activity using
some orchestration tooling which could consume these notifications and
deduce what challenges the instance migration is encountering and thus
determine how to address any issues.

To be clear, our notifications are not designed to be consumed by end users.

Yep, I see this as something cloud operations tooling could consume.
It does not address end user's needs.



The use cases are only partially addressed by the current
implementation, they can repeatedly get the server details and look at
the progress percentage to see how quickly (or even if) it is
increasing and determine how long the instance is likely to be
migrating for.  However for an instance that has a large disk and/or
is doing a high rate of disk i/o they may see the percentage complete
(i.e. memory) repeatedly showing 90%+ but the instance migration does
not complete.

Agreed reporting progress, particularly with live-migrate, is awful right now.

Long term, I have my eye on this work:
https://etherpad.openstack.org/p/liberty-cross-project-user-notifications

But we should work on getting a good conceptual model for the progress
that can be exposed using the above system.


The nova spec https://review.openstack.org/#/c/248472/ suggests making
detailed information available via the os-migrations object.  This is
not a bad idea but I have some issues with the implementation that I
will share on that spec.

We do also need something that works across all hypervisor types.

Lets talk more on that spec review.

Thanks,
johnthetubaguy


--
Paul Carlton
Software Engineer
Cloud Services
Hewlett Packard
BUK03:T242
Longdown Avenue
Stoke Gifford
Bristol BS34 8QZ

Mobile:+44 (0)7768 994283
Email:mailto:paul.carlt...@hpe.com
Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN 
Registered No: 690597 England.
The contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error, you should delete it from 
your system immediately and advise the sender. To any recipient of this message within 
HP, unless otherwise stated you should consider this message and attachments as "HP 
CONFIDENTIAL".




smime.p7s
Description: S/MIME Cryptographic Signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [nova] Migration progress

2015-11-23 Thread Daniel P. Berrange
On Mon, Nov 23, 2015 at 08:36:32AM +, Paul Carlton wrote:
> John
> 
> At the live migration sub team meeting I undertook to look at the issue
> of progress reporting.
> 
> The use cases I'm envisaging are...
> 
> As a user I want to know how much longer my instance will be migrating
> for.
> 
> As an operator I want to identify any migration that are making slow
>  progress so I can expedite their progress or abort them.
> 
> The current implementation reports on the instance's migration with
> respect to memory transfer, using the total memory and memory remaining
> fields from libvirt to report the percentage of memory still to be
> transferred.  Due to the instance writing to pages already transferred
> this percentage can go up as well as down.  Daniel has done a good job
> of generating regular log records to report progress and highlight lack
> of progress but from the API all a user/operator can see is the current
> percentage complete.  By observing this periodically they can identify
> instance migrations that are struggling to migrate memory pages fast
> enough to keep pace with the instance's memory updates.
> 
> The problem is that at present we have only one field, the instance
> progress, to record progress.  With a live migration there are measures
> of progress, how much of the ephemeral disks (not needed for shared
> disk setups) have been copied and how much of the memory has been
> copied. Both can go up and down as the instance writes to pages already
> copied causing those pages to need to be copied again.  As Daniel says
> in his comments in the code, the disk size could dwarf the memory so
> reporting both in single percentage number is problematic.
> 
> We could add an additional progress item to the instance object, i.e.
> disk progress and memory progress but that seems odd to have an
> additional progress field only for this operation so this is probably
> a non starter!
> 
> For operations staff with access to log files we could report disk
> progress as well as memory in the log file, however that does not
> address the needs of users and whilst log files are the right place for
> support staff to look when investigating issues operational tooling
> is much better served by notification messages.
> 
> Thus I'd recommend generating periodic notifications during a migration
> to report both memory and disk progress would be useful?  Cloud
> operators are likely to manage their instance migration activity using
> some orchestration tooling which could consume these notifications and
> deduce what challenges the instance migration is encountering and thus
> determine how to address any issues.
> 
> The use cases are only partially addressed by the current
> implementation, they can repeatedly get the server details and look at
> the progress percentage to see how quickly (or even if) it is
> increasing and determine how long the instance is likely to be
> migrating for.  However for an instance that has a large disk and/or
> is doing a high rate of disk i/o they may see the percentage complete
> (i.e. memory) repeatedly showing 90%+ but the instance migration does
> not complete.
> 
> The nova spec https://review.openstack.org/#/c/248472/ suggests making
> detailed information available via the os-migrations object.  This is
> not a bad idea but I have some issues with the implementation that I
> will share on that spec.

As I mentioned in the spec, I won't support exposing anything other
than disk total + remaining via the API. All the other stats are
low level QEMU specific implementation details that I feel the public
API users have no business knowing about.

In general I think we need to be wary of exposing lots of info + knobs
via the API, as that direction essentially ends up forcing the problem
onto client application. The focus should really be on ensuring that
Nova consumes all these stats exposed by QEMU and makes decisions
itself based on that.

At most an external application should have information on the data
transfer progress. I'm not even convinced that applications should
need to be able to figure out if a live migration is stuck. I generally
think that any scenario in which a live migration can get stuck is a
bug in Nova's management of the migration process. IOW, the focus of
our efforts should be on ensuring Nova does the right thing to guarantee
that live migration will never get stuck. At which point an Nova client
user / application should really only care about the overall progress
of a live migration.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage