Re: [openstack-dev] [nova] Migration progress
Hi, In case of live-migration reporting I'd rather go with real-time stats, queried from compute, instead of reporting this data to db first. While amount of of rpc requests/db updates is relatively small, total number of such requests depends on amount of active migrations. While realtime data from compute allows to decrease it. Not every migration will trigger operator to gather statistics, and each of triggered will require only 2 rpc per request instead of 2 rpc and db write per 3/5/etc. seconds. Timofey. On Sun, Feb 7, 2016 at 10:31 PM, Jay Pipeswrote: > On 02/04/2016 11:02 PM, Bhandaru, Malini K wrote: > >> Another thought, for such ephemeral/changing data, such as progress, >> why not save the information in the cache (and flush to database at a >> lower rate), and retrieve for display to active listeners/UI from the >> cache. Once complete or aborted, of course flush the cache. >> >> Also should we provide a "verbose flag", that is only capture >> progress information when requested? That is when a human user might >> be issuing the command from the cli or GUI tool. >> > > I agree with you, Malini, on the above suggestion that there is some doubt > as to the value of saving this temporal data to the database. > > Why not just have an on-demand model that simply routes the request for > progress information directly to the compute node and sends the progress > amount back directly to the nova-api service instead of going to the > database at all? > > Another alternative would be to use a push model instead of a poll model, > but that would require a pretty significant change to the code... > > Best, > -jay > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
On 02/04/2016 11:02 PM, Bhandaru, Malini K wrote: Another thought, for such ephemeral/changing data, such as progress, why not save the information in the cache (and flush to database at a lower rate), and retrieve for display to active listeners/UI from the cache. Once complete or aborted, of course flush the cache. Also should we provide a "verbose flag", that is only capture progress information when requested? That is when a human user might be issuing the command from the cli or GUI tool. I agree with you, Malini, on the above suggestion that there is some doubt as to the value of saving this temporal data to the database. Why not just have an on-demand model that simply routes the request for progress information directly to the compute node and sends the progress amount back directly to the nova-api service instead of going to the database at all? Another alternative would be to use a push model instead of a poll model, but that would require a pretty significant change to the code... Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
I think we can add a config option for this and set a theoretical proper default value, we also add help messages to inform the the user about how inappropriate value of this config option will effect the performance. On Wed, Feb 3, 2016 at 7:45 PM, Daniel P. Berrangewrote: > On Wed, Feb 03, 2016 at 11:27:16AM +, Paul Carlton wrote: > > On 03/02/16 10:49, Daniel P. Berrange wrote: > > >On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote: > > >>On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote: > > >>>Hello everyone, > > >>> > > >>>On the yesterday's live migration meeting we had concerns that > interval of > > >>>writing migration progress to the database is too short. > > >>> > > >>>Information about migration progress will be stored in the database > and > > >>>exposed through the API (/servers//migrations/). In current > > >>>proposition [1] migration progress will be updated every 2 seconds. It > > >>>basically means that every 2 seconds a call through RPC will go from > compute > > >>>to conductor to write migration data to the database. In case of > parallel > > >>>live migrations each migration will report progress by itself. > > >>> > > >>>Isn't 2 seconds interval too short for updates if the information is > exposed > > >>>through the API and it requires RPC and DB call to actually save it > in the > > >>>DB? > > >>> > > >>>Our default configuration allows only for 1 concurrent live migration > [2], > > >>>but it might vary between different deployments and use cases as it is > > >>>configurable. Someone might want to trigger 10 (or even more) > parallel live > > >>>migrations and each might take even a day to finish in case of block > > >>>migration. Also if deployment is big enough rabbitmq might be > fully-loaded. > > >>>I'm not sure whether updating each migration every 2 seconds makes > sense in > > >>>this case. On the other hand it might be hard to observe fast enough > that > > >>>migration is stuck if we increase this interval... > > >>Do we have any actual data that this is a real problem. I have a > pretty hard > > >>time believing that a database update of a single field every 2 > seconds is > > >>going to be what pushes Nova over the edge into a performance > collapse, even > > >>if there are 20 migrations running in parallel, when you compare it to > the > > >>amount of DB queries & updates done across other areas of the code for > pretty > > >>much every singke API call and background job. > > >Also note that progress is rounded to the nearest integer. So even if > the > > >migration runs all day, there is a maximum of 100 possible changes in > value > > >for the progress field, so most of the updates should turn in to no-ops > at > > >the database level. > > > > > >Regards, > > >Daniel > > I agree with Daniel, these rpc and db access ops are a tiny percentage > > of the overall load on rabbit and mysql and properly configured these > > subsystems should have no issues with this workload. > > > > One correction, unless I'm misreading it, the existing > > _live_migration_monitor code updates the progress field of the instance > > record every 5 seconds. However this value can go up and down so > > an infinate number of updates are possible? > > Oh yes, you are in fact correct. Technically you could have an unbounded > number of updates if migration goes backwards. Some mitigation against > this is if we see progress going backwards we'll actually abort the > migration if it gets stuck for too long. We'll also be progressively > increasing the permitted downtime. So except in pathelogical scenarios > I think the number of updates should still be relatively small. > > > However, the issue raised here is not with the existing implementation > > but with the proposed change > > https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py > > This add a save() operation on the migration object every 2 seconds > > Ok, that is more heavy weight since it is recording the raw byte values > and so it is guaranteed to do a database update pretty much every time. > It still shouldn't be too unreasonable a loading though. FWIW I think > it is worth being consistent in the update frequency betweeen the > progress value & the migration object save, so switching to be every > 5 seconds probably makes more sense, so we know both objects are > reflecting the same point in time. > > Regards, > Daniel > -- > |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ > :| > |: http://libvirt.org -o- http://virt-manager.org > :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ > :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc > :| > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe:
Re: [openstack-dev] [nova] Migration progress
I agree with Daniel, keep the periods consistent 5 - 5 . Another thought, for such ephemeral/changing data, such as progress, why not save the information in the cache (and flush to database at a lower rate), and retrieve for display to active listeners/UI from the cache. Once complete or aborted, of course flush the cache. Also should we provide a "verbose flag", that is only capture progress information when requested? That is when a human user might be issuing the command from the cli or GUI tool. Regards Malini -Original Message- From: Daniel P. Berrange [mailto:berra...@redhat.com] Sent: Wednesday, February 03, 2016 11:46 AM To: Paul Carlton <paul.carlt...@hpe.com> Cc: Feng, Shaohe <shaohe.f...@intel.com>; OpenStack Development Mailing List (not for usage questions) <openstack-dev@lists.openstack.org> Subject: Re: [openstack-dev] [nova] Migration progress On Wed, Feb 03, 2016 at 11:27:16AM +, Paul Carlton wrote: > On 03/02/16 10:49, Daniel P. Berrange wrote: > >On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote: > >>On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote: > >>>Hello everyone, > >>> > >>>On the yesterday's live migration meeting we had concerns that > >>>interval of writing migration progress to the database is too short. > >>> > >>>Information about migration progress will be stored in the database > >>>and exposed through the API (/servers//migrations/). In > >>>current proposition [1] migration progress will be updated every 2 > >>>seconds. It basically means that every 2 seconds a call through RPC > >>>will go from compute to conductor to write migration data to the > >>>database. In case of parallel live migrations each migration will report > >>>progress by itself. > >>> > >>>Isn't 2 seconds interval too short for updates if the information > >>>is exposed through the API and it requires RPC and DB call to > >>>actually save it in the DB? > >>> > >>>Our default configuration allows only for 1 concurrent live > >>>migration [2], but it might vary between different deployments and > >>>use cases as it is configurable. Someone might want to trigger 10 > >>>(or even more) parallel live migrations and each might take even a > >>>day to finish in case of block migration. Also if deployment is big enough > >>>rabbitmq might be fully-loaded. > >>>I'm not sure whether updating each migration every 2 seconds makes > >>>sense in this case. On the other hand it might be hard to observe > >>>fast enough that migration is stuck if we increase this interval... > >>Do we have any actual data that this is a real problem. I have a > >>pretty hard time believing that a database update of a single field > >>every 2 seconds is going to be what pushes Nova over the edge into a > >>performance collapse, even if there are 20 migrations running in > >>parallel, when you compare it to the amount of DB queries & updates > >>done across other areas of the code for pretty much every singke API call > >>and background job. > >Also note that progress is rounded to the nearest integer. So even if > >the migration runs all day, there is a maximum of 100 possible > >changes in value for the progress field, so most of the updates > >should turn in to no-ops at the database level. > > > >Regards, > >Daniel > I agree with Daniel, these rpc and db access ops are a tiny percentage > of the overall load on rabbit and mysql and properly configured these > subsystems should have no issues with this workload. > > One correction, unless I'm misreading it, the existing > _live_migration_monitor code updates the progress field of the > instance record every 5 seconds. However this value can go up and > down so an infinate number of updates are possible? Oh yes, you are in fact correct. Technically you could have an unbounded number of updates if migration goes backwards. Some mitigation against this is if we see progress going backwards we'll actually abort the migration if it gets stuck for too long. We'll also be progressively increasing the permitted downtime. So except in pathelogical scenarios I think the number of updates should still be relatively small. > However, the issue raised here is not with the existing implementation > but with the proposed change > https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py > This add a save() operation on the migration object every 2 seconds Ok, that is more heavy weight sinc
Re: [openstack-dev] [nova] Migration progress
On 2016年02月05日 12:02, Bhandaru, Malini K wrote: I agree with Daniel, keep the periods consistent 5 - 5 . Another thought, for such ephemeral/changing data, such as progress, why not save the information in the cache (and flush to database at a lower rate), and retrieve for display to active listeners/UI from the cache. Once complete or aborted, of course flush the cache. hi Malini It's good idea to use cache to save the information while doing migration, but the problem is how can we access that cache while we use CLI (nova-api)? These information are generated from nova-compute node , there should be one method to sync them to nova-conductor(which means DB). Also should we provide a "verbose flag", that is only capture progress information when requested? That is when a human user might be issuing the command from the cli or GUI tool. I am +1 on this, yeah, some of other service may help. -- Best Regards, Eli(Li Yong)Qiao Intel OTC China <>__ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote: > On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote: > > Hello everyone, > > > > On the yesterday's live migration meeting we had concerns that interval of > > writing migration progress to the database is too short. > > > > Information about migration progress will be stored in the database and > > exposed through the API (/servers//migrations/). In current > > proposition [1] migration progress will be updated every 2 seconds. It > > basically means that every 2 seconds a call through RPC will go from compute > > to conductor to write migration data to the database. In case of parallel > > live migrations each migration will report progress by itself. > > > > Isn't 2 seconds interval too short for updates if the information is exposed > > through the API and it requires RPC and DB call to actually save it in the > > DB? > > > > Our default configuration allows only for 1 concurrent live migration [2], > > but it might vary between different deployments and use cases as it is > > configurable. Someone might want to trigger 10 (or even more) parallel live > > migrations and each might take even a day to finish in case of block > > migration. Also if deployment is big enough rabbitmq might be fully-loaded. > > I'm not sure whether updating each migration every 2 seconds makes sense in > > this case. On the other hand it might be hard to observe fast enough that > > migration is stuck if we increase this interval... > > Do we have any actual data that this is a real problem. I have a pretty hard > time believing that a database update of a single field every 2 seconds is > going to be what pushes Nova over the edge into a performance collapse, even > if there are 20 migrations running in parallel, when you compare it to the > amount of DB queries & updates done across other areas of the code for pretty > much every singke API call and background job. Also note that progress is rounded to the nearest integer. So even if the migration runs all day, there is a maximum of 100 possible changes in value for the progress field, so most of the updates should turn in to no-ops at the database level. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote: > Hello everyone, > > On the yesterday's live migration meeting we had concerns that interval of > writing migration progress to the database is too short. > > Information about migration progress will be stored in the database and > exposed through the API (/servers//migrations/). In current > proposition [1] migration progress will be updated every 2 seconds. It > basically means that every 2 seconds a call through RPC will go from compute > to conductor to write migration data to the database. In case of parallel > live migrations each migration will report progress by itself. > > Isn't 2 seconds interval too short for updates if the information is exposed > through the API and it requires RPC and DB call to actually save it in the > DB? > > Our default configuration allows only for 1 concurrent live migration [2], > but it might vary between different deployments and use cases as it is > configurable. Someone might want to trigger 10 (or even more) parallel live > migrations and each might take even a day to finish in case of block > migration. Also if deployment is big enough rabbitmq might be fully-loaded. > I'm not sure whether updating each migration every 2 seconds makes sense in > this case. On the other hand it might be hard to observe fast enough that > migration is stuck if we increase this interval... Do we have any actual data that this is a real problem. I have a pretty hard time believing that a database update of a single field every 2 seconds is going to be what pushes Nova over the edge into a performance collapse, even if there are 20 migrations running in parallel, when you compare it to the amount of DB queries & updates done across other areas of the code for pretty much every singke API call and background job. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
> -Original Message- > From: Daniel P. Berrange [mailto:berra...@redhat.com] > Sent: 03 February 2016 10:49 > To: OpenStack Development Mailing List (not for usage questions) > Cc: Feng, Shaohe > Subject: Re: [openstack-dev] [nova] Migration progress > > On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote: > > On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote: > > > Hello everyone, > > > > > > On the yesterday's live migration meeting we had concerns that > > > interval of writing migration progress to the database is too short. > > > > > > Information about migration progress will be stored in the database > > > and exposed through the API (/servers//migrations/). In > > > current proposition [1] migration progress will be updated every 2 > > > seconds. It basically means that every 2 seconds a call through RPC > > > will go from compute to conductor to write migration data to the > > > database. In case of parallel live migrations each migration will report > progress by itself. > > > > > > Isn't 2 seconds interval too short for updates if the information is > > > exposed through the API and it requires RPC and DB call to actually > > > save it in the DB? > > > > > > Our default configuration allows only for 1 concurrent live > > > migration [2], but it might vary between different deployments and > > > use cases as it is configurable. Someone might want to trigger 10 > > > (or even more) parallel live migrations and each might take even a > > > day to finish in case of block migration. Also if deployment is big enough > rabbitmq might be fully-loaded. > > > I'm not sure whether updating each migration every 2 seconds makes > > > sense in this case. On the other hand it might be hard to observe > > > fast enough that migration is stuck if we increase this interval... > > > > Do we have any actual data that this is a real problem. I have a > > pretty hard time believing that a database update of a single field > > every 2 seconds is going to be what pushes Nova over the edge into a > > performance collapse, even if there are 20 migrations running in > > parallel, when you compare it to the amount of DB queries & updates > > done across other areas of the code for pretty much every singke API call > and background job. As a data point: when we were doing live migrations in HP public cloud for rolling updates we were maintaining approximately 150 concurrent migrations through the process. At 2s intervals that would make approx. 75 updates per second. We don't feel that would have been a problem. We also spoke to Michael Still and he thought it wouldn't be a problem for Rack Space (remembering they have cells). Having said that I have no idea of numbers I their case and would rather they spoke for themselves. In this thread. > > Also note that progress is rounded to the nearest integer. So even if the > migration runs all day, there is a maximum of 100 possible changes in value > for the progress field, so most of the updates should turn in to no-ops at the > database level. > > Regards, > Daniel > -- > |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| > |: http://libvirt.org -o- http://virt-manager.org :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| > > __ > > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: OpenStack-dev- > requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
On 03/02/16 10:49, Daniel P. Berrange wrote: On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote: On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote: Hello everyone, On the yesterday's live migration meeting we had concerns that interval of writing migration progress to the database is too short. Information about migration progress will be stored in the database and exposed through the API (/servers//migrations/). In current proposition [1] migration progress will be updated every 2 seconds. It basically means that every 2 seconds a call through RPC will go from compute to conductor to write migration data to the database. In case of parallel live migrations each migration will report progress by itself. Isn't 2 seconds interval too short for updates if the information is exposed through the API and it requires RPC and DB call to actually save it in the DB? Our default configuration allows only for 1 concurrent live migration [2], but it might vary between different deployments and use cases as it is configurable. Someone might want to trigger 10 (or even more) parallel live migrations and each might take even a day to finish in case of block migration. Also if deployment is big enough rabbitmq might be fully-loaded. I'm not sure whether updating each migration every 2 seconds makes sense in this case. On the other hand it might be hard to observe fast enough that migration is stuck if we increase this interval... Do we have any actual data that this is a real problem. I have a pretty hard time believing that a database update of a single field every 2 seconds is going to be what pushes Nova over the edge into a performance collapse, even if there are 20 migrations running in parallel, when you compare it to the amount of DB queries & updates done across other areas of the code for pretty much every singke API call and background job. Also note that progress is rounded to the nearest integer. So even if the migration runs all day, there is a maximum of 100 possible changes in value for the progress field, so most of the updates should turn in to no-ops at the database level. Regards, Daniel I agree with Daniel, these rpc and db access ops are a tiny percentage of the overall load on rabbit and mysql and properly configured these subsystems should have no issues with this workload. One correction, unless I'm misreading it, the existing _live_migration_monitor code updates the progress field of the instance record every 5 seconds. However this value can go up and down so an infinate number of updates are possible? However, the issue raised here is not with the existing implementation but with the proposed change https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py This add a save() operation on the migration object every 2 seconds Paul Carlton Software Engineer Cloud Services Hewlett Packard Enterprise BUK03:T242 Longdown Avenue Stoke Gifford Bristol BS34 8QZ Mobile:+44 (0)7768 994283 Office:+44 (0)117 316 2189 Email:mailto:paul.carlt...@hpe.com irc: paul-carlton2 Hewlett-Packard Enterprise Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN Registered No: 690597 England. The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
On Wed, Feb 03, 2016 at 11:27:16AM +, Paul Carlton wrote: > On 03/02/16 10:49, Daniel P. Berrange wrote: > >On Wed, Feb 03, 2016 at 10:44:36AM +, Daniel P. Berrange wrote: > >>On Wed, Feb 03, 2016 at 10:37:24AM +, Koniszewski, Pawel wrote: > >>>Hello everyone, > >>> > >>>On the yesterday's live migration meeting we had concerns that interval of > >>>writing migration progress to the database is too short. > >>> > >>>Information about migration progress will be stored in the database and > >>>exposed through the API (/servers//migrations/). In current > >>>proposition [1] migration progress will be updated every 2 seconds. It > >>>basically means that every 2 seconds a call through RPC will go from > >>>compute > >>>to conductor to write migration data to the database. In case of parallel > >>>live migrations each migration will report progress by itself. > >>> > >>>Isn't 2 seconds interval too short for updates if the information is > >>>exposed > >>>through the API and it requires RPC and DB call to actually save it in the > >>>DB? > >>> > >>>Our default configuration allows only for 1 concurrent live migration [2], > >>>but it might vary between different deployments and use cases as it is > >>>configurable. Someone might want to trigger 10 (or even more) parallel live > >>>migrations and each might take even a day to finish in case of block > >>>migration. Also if deployment is big enough rabbitmq might be fully-loaded. > >>>I'm not sure whether updating each migration every 2 seconds makes sense in > >>>this case. On the other hand it might be hard to observe fast enough that > >>>migration is stuck if we increase this interval... > >>Do we have any actual data that this is a real problem. I have a pretty hard > >>time believing that a database update of a single field every 2 seconds is > >>going to be what pushes Nova over the edge into a performance collapse, even > >>if there are 20 migrations running in parallel, when you compare it to the > >>amount of DB queries & updates done across other areas of the code for > >>pretty > >>much every singke API call and background job. > >Also note that progress is rounded to the nearest integer. So even if the > >migration runs all day, there is a maximum of 100 possible changes in value > >for the progress field, so most of the updates should turn in to no-ops at > >the database level. > > > >Regards, > >Daniel > I agree with Daniel, these rpc and db access ops are a tiny percentage > of the overall load on rabbit and mysql and properly configured these > subsystems should have no issues with this workload. > > One correction, unless I'm misreading it, the existing > _live_migration_monitor code updates the progress field of the instance > record every 5 seconds. However this value can go up and down so > an infinate number of updates are possible? Oh yes, you are in fact correct. Technically you could have an unbounded number of updates if migration goes backwards. Some mitigation against this is if we see progress going backwards we'll actually abort the migration if it gets stuck for too long. We'll also be progressively increasing the permitted downtime. So except in pathelogical scenarios I think the number of updates should still be relatively small. > However, the issue raised here is not with the existing implementation > but with the proposed change > https://review.openstack.org/#/c/258813/5/nova/virt/libvirt/driver.py > This add a save() operation on the migration object every 2 seconds Ok, that is more heavy weight since it is recording the raw byte values and so it is guaranteed to do a database update pretty much every time. It still shouldn't be too unreasonable a loading though. FWIW I think it is worth being consistent in the update frequency betweeen the progress value & the migration object save, so switching to be every 5 seconds probably makes more sense, so we know both objects are reflecting the same point in time. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
Hi Paul, Comments inline: 2015-11-23 16:36 GMT+08:00 Paul Carlton: > John > > At the live migration sub team meeting I undertook to look at the issue > of progress reporting. > > The use cases I'm envisaging are... > > As a user I want to know how much longer my instance will be migrating > for. > > As an operator I want to identify any migration that are making slow > progress so I can expedite their progress or abort them. > > The current implementation reports on the instance's migration with > respect to memory transfer, using the total memory and memory remaining > fields from libvirt to report the percentage of memory still to be > transferred. Due to the instance writing to pages already transferred > this percentage can go up as well as down. Daniel has done a good job > of generating regular log records to report progress and highlight lack > of progress but from the API all a user/operator can see is the current > percentage complete. By observing this periodically they can identify > instance migrations that are struggling to migrate memory pages fast > enough to keep pace with the instance's memory updates. > > The problem is that at present we have only one field, the instance > progress, to record progress. With a live migration there are measures > [Shaohe]: >From this link, OpenStack API ref: http://developer.openstack.org/api-ref-compute-v2.1.html#listDetailServers It describe the instance progress: A percentage value of the build progress. But for libvirt driver it does be migration progress. For other driver it is building progress. And there is a spec to propose some change. https://review.openstack.org/#/c/249086/ > of progress, how much of the ephemeral disks (not needed for shared > disk setups) have been copied and how much of the memory has been > copied. Both can go up and down as the instance writes to pages already > copied causing those pages to need to be copied again. As Daniel says > in his comments in the code, the disk size could dwarf the memory so > reporting both in single percentage number is problematic. > > We could add an additional progress item to the instance object, i.e. > disk progress and memory progress but that seems odd to have an > additional progress field only for this operation so this is probably > a non starter! > > For operations staff with access to log files we could report disk > progress as well as memory in the log file, however that does not > address the needs of users and whilst log files are the right place for > support staff to look when investigating issues operational tooling > is much better served by notification messages. > > Thus I'd recommend generating periodic notifications during a migration > to report both memory and disk progress would be useful? Cloud > operators are likely to manage their instance migration activity using > some orchestration tooling which could consume these notifications and > deduce what challenges the instance migration is encountering and thus > determine how to address any issues. > > The use cases are only partially addressed by the current > implementation, they can repeatedly get the server details and look at > the progress percentage to see how quickly (or even if) it is > increasing and determine how long the instance is likely to be > migrating for. However for an instance that has a large disk and/or > is doing a high rate of disk i/o they may see the percentage complete > (i.e. memory) repeatedly showing 90%+ but the instance migration does > not complete. > > The nova spec https://review.openstack.org/#/c/248472/ suggests making > detailed information available via the os-migrations object. This is > not a bad idea but I have some issues with the implementation that I > will share on that spec. > [Shaohe]: About this spec, Daniel has give some comments on it, and we have updated it. Maybe we can work together on it to make it more better. I have worked on libvirt multi-thread compress migration for libvirt. and looks into some live migrations performance optimizations. and generate an ideas: 1. Let nova expose more live migration details, such as the RAM statistics, xbzrle-cache status, also the information of multi-thread compression in future, and so on. 2. nova can enable auto-converge, tune the xbzrle-cache and multi-thread compression dynamically. 3. Then other project can make a good strategy to tune the live migration base on the migration details. For example: cache size is a performance key for xbzrle, the best is that the cache size are same with the guest total RAM, but this maybe not always available on host. Multi-thread compress level is higher is better, but it is cpu consume, Auto converge will slow down the CPU running. Seems things not always as good as I had expected. Also we have submit a topic to summit about this idea, but not accepted. Topic: Link:
Re: [openstack-dev] [nova] Migration progress
On 23 November 2015 at 08:36, Paul Carltonwrote: > John > > At the live migration sub team meeting I undertook to look at the issue > of progress reporting. > > The use cases I'm envisaging are... > > As a user I want to know how much longer my instance will be migrating > for. > > As an operator I want to identify any migration that are making slow > progress so I can expedite their progress or abort them. +1 Agreed with this need. Proposals to add pause and cancel clearly make this need more acute. > The current implementation reports on the instance's migration with > respect to memory transfer, using the total memory and memory remaining > fields from libvirt to report the percentage of memory still to be > transferred. Due to the instance writing to pages already transferred > this percentage can go up as well as down. Daniel has done a good job > of generating regular log records to report progress and highlight lack > of progress but from the API all a user/operator can see is the current > percentage complete. By observing this periodically they can identify > instance migrations that are struggling to migrate memory pages fast > enough to keep pace with the instance's memory updates. > > The problem is that at present we have only one field, the instance > progress, to record progress. With a live migration there are measures > of progress, how much of the ephemeral disks (not needed for shared > disk setups) have been copied and how much of the memory has been > copied. Both can go up and down as the instance writes to pages already > copied causing those pages to need to be copied again. As Daniel says > in his comments in the code, the disk size could dwarf the memory so > reporting both in single percentage number is problematic. > > We could add an additional progress item to the instance object, i.e. > disk progress and memory progress but that seems odd to have an > additional progress field only for this operation so this is probably > a non starter! > > For operations staff with access to log files we could report disk > progress as well as memory in the log file, however that does not > address the needs of users and whilst log files are the right place for > support staff to look when investigating issues operational tooling > is much better served by notification messages. > > Thus I'd recommend generating periodic notifications during a migration > to report both memory and disk progress would be useful? Cloud > operators are likely to manage their instance migration activity using > some orchestration tooling which could consume these notifications and > deduce what challenges the instance migration is encountering and thus > determine how to address any issues. To be clear, our notifications are not designed to be consumed by end users. > The use cases are only partially addressed by the current > implementation, they can repeatedly get the server details and look at > the progress percentage to see how quickly (or even if) it is > increasing and determine how long the instance is likely to be > migrating for. However for an instance that has a large disk and/or > is doing a high rate of disk i/o they may see the percentage complete > (i.e. memory) repeatedly showing 90%+ but the instance migration does > not complete. Agreed reporting progress, particularly with live-migrate, is awful right now. Long term, I have my eye on this work: https://etherpad.openstack.org/p/liberty-cross-project-user-notifications But we should work on getting a good conceptual model for the progress that can be exposed using the above system. > The nova spec https://review.openstack.org/#/c/248472/ suggests making > detailed information available via the os-migrations object. This is > not a bad idea but I have some issues with the implementation that I > will share on that spec. We do also need something that works across all hypervisor types. Lets talk more on that spec review. Thanks, johnthetubaguy __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Migration progress
On 23/11/15 11:02, John Garbutt wrote: On 23 November 2015 at 08:36, Paul Carltonwrote: John At the live migration sub team meeting I undertook to look at the issue of progress reporting. The use cases I'm envisaging are... As a user I want to know how much longer my instance will be migrating for. As an operator I want to identify any migration that are making slow progress so I can expedite their progress or abort them. +1 Agreed with this need. Proposals to add pause and cancel clearly make this need more acute. The current implementation reports on the instance's migration with respect to memory transfer, using the total memory and memory remaining fields from libvirt to report the percentage of memory still to be transferred. Due to the instance writing to pages already transferred this percentage can go up as well as down. Daniel has done a good job of generating regular log records to report progress and highlight lack of progress but from the API all a user/operator can see is the current percentage complete. By observing this periodically they can identify instance migrations that are struggling to migrate memory pages fast enough to keep pace with the instance's memory updates. The problem is that at present we have only one field, the instance progress, to record progress. With a live migration there are measures of progress, how much of the ephemeral disks (not needed for shared disk setups) have been copied and how much of the memory has been copied. Both can go up and down as the instance writes to pages already copied causing those pages to need to be copied again. As Daniel says in his comments in the code, the disk size could dwarf the memory so reporting both in single percentage number is problematic. We could add an additional progress item to the instance object, i.e. disk progress and memory progress but that seems odd to have an additional progress field only for this operation so this is probably a non starter! For operations staff with access to log files we could report disk progress as well as memory in the log file, however that does not address the needs of users and whilst log files are the right place for support staff to look when investigating issues operational tooling is much better served by notification messages. Thus I'd recommend generating periodic notifications during a migration to report both memory and disk progress would be useful? Cloud operators are likely to manage their instance migration activity using some orchestration tooling which could consume these notifications and deduce what challenges the instance migration is encountering and thus determine how to address any issues. To be clear, our notifications are not designed to be consumed by end users. Yep, I see this as something cloud operations tooling could consume. It does not address end user's needs. The use cases are only partially addressed by the current implementation, they can repeatedly get the server details and look at the progress percentage to see how quickly (or even if) it is increasing and determine how long the instance is likely to be migrating for. However for an instance that has a large disk and/or is doing a high rate of disk i/o they may see the percentage complete (i.e. memory) repeatedly showing 90%+ but the instance migration does not complete. Agreed reporting progress, particularly with live-migrate, is awful right now. Long term, I have my eye on this work: https://etherpad.openstack.org/p/liberty-cross-project-user-notifications But we should work on getting a good conceptual model for the progress that can be exposed using the above system. The nova spec https://review.openstack.org/#/c/248472/ suggests making detailed information available via the os-migrations object. This is not a bad idea but I have some issues with the implementation that I will share on that spec. We do also need something that works across all hypervisor types. Lets talk more on that spec review. Thanks, johnthetubaguy -- Paul Carlton Software Engineer Cloud Services Hewlett Packard BUK03:T242 Longdown Avenue Stoke Gifford Bristol BS34 8QZ Mobile:+44 (0)7768 994283 Email:mailto:paul.carlt...@hpe.com Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN Registered No: 690597 England. The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". smime.p7s Description: S/MIME Cryptographic Signature __ OpenStack Development Mailing List (not for usage questions) Unsubscribe:
Re: [openstack-dev] [nova] Migration progress
On Mon, Nov 23, 2015 at 08:36:32AM +, Paul Carlton wrote: > John > > At the live migration sub team meeting I undertook to look at the issue > of progress reporting. > > The use cases I'm envisaging are... > > As a user I want to know how much longer my instance will be migrating > for. > > As an operator I want to identify any migration that are making slow > progress so I can expedite their progress or abort them. > > The current implementation reports on the instance's migration with > respect to memory transfer, using the total memory and memory remaining > fields from libvirt to report the percentage of memory still to be > transferred. Due to the instance writing to pages already transferred > this percentage can go up as well as down. Daniel has done a good job > of generating regular log records to report progress and highlight lack > of progress but from the API all a user/operator can see is the current > percentage complete. By observing this periodically they can identify > instance migrations that are struggling to migrate memory pages fast > enough to keep pace with the instance's memory updates. > > The problem is that at present we have only one field, the instance > progress, to record progress. With a live migration there are measures > of progress, how much of the ephemeral disks (not needed for shared > disk setups) have been copied and how much of the memory has been > copied. Both can go up and down as the instance writes to pages already > copied causing those pages to need to be copied again. As Daniel says > in his comments in the code, the disk size could dwarf the memory so > reporting both in single percentage number is problematic. > > We could add an additional progress item to the instance object, i.e. > disk progress and memory progress but that seems odd to have an > additional progress field only for this operation so this is probably > a non starter! > > For operations staff with access to log files we could report disk > progress as well as memory in the log file, however that does not > address the needs of users and whilst log files are the right place for > support staff to look when investigating issues operational tooling > is much better served by notification messages. > > Thus I'd recommend generating periodic notifications during a migration > to report both memory and disk progress would be useful? Cloud > operators are likely to manage their instance migration activity using > some orchestration tooling which could consume these notifications and > deduce what challenges the instance migration is encountering and thus > determine how to address any issues. > > The use cases are only partially addressed by the current > implementation, they can repeatedly get the server details and look at > the progress percentage to see how quickly (or even if) it is > increasing and determine how long the instance is likely to be > migrating for. However for an instance that has a large disk and/or > is doing a high rate of disk i/o they may see the percentage complete > (i.e. memory) repeatedly showing 90%+ but the instance migration does > not complete. > > The nova spec https://review.openstack.org/#/c/248472/ suggests making > detailed information available via the os-migrations object. This is > not a bad idea but I have some issues with the implementation that I > will share on that spec. As I mentioned in the spec, I won't support exposing anything other than disk total + remaining via the API. All the other stats are low level QEMU specific implementation details that I feel the public API users have no business knowing about. In general I think we need to be wary of exposing lots of info + knobs via the API, as that direction essentially ends up forcing the problem onto client application. The focus should really be on ensuring that Nova consumes all these stats exposed by QEMU and makes decisions itself based on that. At most an external application should have information on the data transfer progress. I'm not even convinced that applications should need to be able to figure out if a live migration is stuck. I generally think that any scenario in which a live migration can get stuck is a bug in Nova's management of the migration process. IOW, the focus of our efforts should be on ensuring Nova does the right thing to guarantee that live migration will never get stuck. At which point an Nova client user / application should really only care about the overall progress of a live migration. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| __ OpenStack Development Mailing List (not for usage