Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder - discussion reminder
Bruce, Thanks, it helps clarify. thanx, deepak On Thu, Mar 20, 2014 at 10:07 PM, Bruce Montague bruce_monta...@symantec.com wrote: HI, Deepak. With the caveat that both the etherpad and Ron's presentation are pretty high-level, my guess is: 1) DR middleware refers to the orchestration engine managing the entire DR process between the primary and secondary sites. (Something like two Heat workflows interacting or a workflow that works across multiple OpenStack deployments.) The replication agent is what does what resembles continually cloning a volume from the primary to the secondary, with snapshots appearing on the secondary at times when the volumes contents are application-consistent and consistent with each other (for all the volumes of a VM or a multi-tier app). These secondary-site snapshots appear at specified rates (so you know how recent your oldest snapshots there will be). For instance, the replication agent might do some sort of snapshot(s) on the primary and then it updates the corresponding volume(s) on the secondary using the primary snapshot(s). This resembles (maybe it could even be) something like DRBD or NBD. Many SAN vendors provide some form of replication agent between SANs. 2) Regarding metadata, the replication agent might only be replicating the volumes of some tenant VMs. It might not be replicating any volumes containing OpenStack metadata. (This is for the smaller tenant use-case, not complete OpenStack deployment mirroring, or somesuch. If complete mirroring was done, maybe you wouldn't have to sync metadata if you designed the system just for that). DR is often something that a tenant might apply only to a set of core servers (key pets). In this use-case the two (or more DR sites) might not be symmetrical. The secondary site needs to know it is in the secondary role. Things like IP addresses, maybe security and firewall rules, might have to change for the workload to run at the secondary site. Applying this metadata to VMs on the secondary site (what needs to change in the personality), when they boot, is probably something Heat can do. -bruce *From:* Deepak Shetty [mailto:dpkshe...@gmail.com] *Sent:* Wednesday, March 19, 2014 11:54 PM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder - discussion reminder Hi List, I was looking at the etherpad and March 19 notes and have few Qs 1) How is the DR middleware (depicted in Ron's youtube video) different than the replication agent (noted in the March 19 etherpad notes). Are they same, if not, how/why are they different ? 2) Maybe a dumb Q.. but still.. Why do we need to worry about syncing metadata differently ? If all the storage that is used across openstack services (and in typical case it might be just 1 backend, say GlsuterFS) are beign replicated durign the DR, wouldn't the metadata be replicated too.. why do we need to be concerned abt it as a separate entity ? thanx, deepak On Wed, Mar 19, 2014 at 2:11 PM, Ronen Kat ronen...@il.ibm.com wrote: For those who are interested we will discuss the disaster recovery use-cases and how to proceed toward the Juno summit on March 19 at 17:00 UTC (invitation below) Call-in: https://www.teleconference.att.com/servlet/glbAccess?process=1accessCode=6406941accessNumber=1809417783#C2 Passcode: 6406941 Etherpad: https://etherpad.openstack.org/p/juno-disaster-recovery-call-for-stakeholders Wiki: https://wiki.openstack.org/wiki/DisasterRecovery Regards, __ Ronen I. Kat, PhD Storage Research *IBM Research - Haifa* Phone: +972.3.7689493 Email: ronen...@il.ibm.com From:Luohao (brian) brian.luo...@huawei.com To:OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Date:14/03/2014 03:59 AM Subject:Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder -- 1. fsfreeze with vss has been added to qemu upstream, see http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg01963.html for usage. 2. libvirt allows a client to send any commands to qemu-ga, see http://wiki.libvirt.org/page/Qemu_guest_agent 3. linux fsfreeze is not equivalent to windows fsfreeze+vss. Linux fsreeze offers fs consistency only, while windows vss allows agents like sqlserver to register their plugins to flush their cache to disk when a snapshot occurs. 4. my understanding is xenserver does not support fsfreeze+vss now, because xenserver normally does not use block backend in qemu. -Original Message- From: Bruce Montague [mailto:bruce_monta...@symantec.combruce_monta...@symantec.com] Sent: Thursday, March 13, 2014 10:35 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack
Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder - discussion reminder
Hi List, I was looking at the etherpad and March 19 notes and have few Qs 1) How is the DR middleware (depicted in Ron's youtube video) different than the replication agent (noted in the March 19 etherpad notes). Are they same, if not, how/why are they different ? 2) Maybe a dumb Q.. but still.. Why do we need to worry about syncing metadata differently ? If all the storage that is used across openstack services (and in typical case it might be just 1 backend, say GlsuterFS) are beign replicated durign the DR, wouldn't the metadata be replicated too.. why do we need to be concerned abt it as a separate entity ? thanx, deepak On Wed, Mar 19, 2014 at 2:11 PM, Ronen Kat ronen...@il.ibm.com wrote: For those who are interested we will discuss the disaster recovery use-cases and how to proceed toward the Juno summit on March 19 at 17:00 UTC (invitation below) Call-in: *https://www.teleconference.att.com/servlet/glbAccess?process=1accessCode=6406941accessNumber=1809417783#C2*https://www.teleconference.att.com/servlet/glbAccess?process=1accessCode=6406941accessNumber=1809417783#C2 Passcode: 6406941 Etherpad: https://etherpad.openstack.org/p/juno-disaster-recovery-call-for-stakeholders Wiki: https://wiki.openstack.org/wiki/DisasterRecovery Regards, __ Ronen I. Kat, PhD Storage Research *IBM Research - Haifa* Phone: +972.3.7689493 Email: ronen...@il.ibm.com From:Luohao (brian) brian.luo...@huawei.com To:OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Date:14/03/2014 03:59 AM Subject:Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder -- 1. fsfreeze with vss has been added to qemu upstream, see http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg01963.html for usage. 2. libvirt allows a client to send any commands to qemu-ga, see http://wiki.libvirt.org/page/Qemu_guest_agent 3. linux fsfreeze is not equivalent to windows fsfreeze+vss. Linux fsreeze offers fs consistency only, while windows vss allows agents like sqlserver to register their plugins to flush their cache to disk when a snapshot occurs. 4. my understanding is xenserver does not support fsfreeze+vss now, because xenserver normally does not use block backend in qemu. -Original Message- From: Bruce Montague [mailto:bruce_monta...@symantec.combruce_monta...@symantec.com] Sent: Thursday, March 13, 2014 10:35 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Hi, about OpenStack and VSS. Does anyone have experience with the qemu project's implementation of VSS support? They appear to have a within-guest agent, qemu-ga, that perhaps can work as a VSS requestor. Does it also work with KVM? Does qemu-ga work with libvirt (can VSS quiesce be triggered via libvirt)? I think there was an effort for qemu-ga to use fsfreeze as an equivalent to VSS on Linux systems, was that done? If so, could an OpenStack API provide a generic quiesce request that would then get passed to libvirt? (Also, the XenServer VSS support seems different than qemu/KVM's, is this true? Can it also be accessed through libvirt? Thanks, -bruce -Original Message- From: Alessandro Pilotti [mailto:apilo...@cloudbasesolutions.comapilo...@cloudbasesolutions.com ] Sent: Thursday, March 13, 2014 6:49 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Those use cases are very important in enterprise scenarios requirements, but there's an important missing piece in the current OpenStack APIs: support for application consistent backups via Volume Shadow Copy (or other solutions) at the instance level, including differential / incremental backups. VSS can be seamlessly added to the Nova Hyper-V driver (it's included with the free Hyper-V Server) with e.g. vSphere and XenServer supporting it as well (quescing) and with the option for third party vendors to add drivers for their solutions. A generic Nova backup / restore API supporting those features is quite straightforward to design. The main question at this stage is if the OpenStack community wants to support those use cases or not. Cinder backup/restore support [1] and volume replication [2] are surely a great starting point in this direction. Alessandro [1] https://review.openstack.org/#/c/69351/ [2] https://review.openstack.org/#/c/64026/ On 12/mar/2014, at 20:45, Bruce Montague bruce_monta...@symantec.com wrote: Hi, regarding the call to create a list of disaster recovery (DR) use cases ( http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html), the following list sketches some speculative OpenStack DR use cases. These use cases do not reflect
[openstack-dev] Disaster Recovery for OpenStack - call for stakeholder - discussion reminder
For those who are interested we will discuss the disaster recovery use-cases and how to proceed toward the Juno summit on March 19 at 17:00 UTC (invitation below) Call-in: https://www.teleconference.att.com/servlet/glbAccess?process=1accessCode=6406941accessNumber=1809417783#C2 Passcode: 6406941 Etherpad: https://etherpad.openstack.org/p/juno-disaster-recovery-call-for-stakeholders Wiki: https://wiki.openstack.org/wiki/DisasterRecovery Regards, __ Ronen I. Kat, PhD Storage Research IBM Research - Haifa Phone: +972.3.7689493 Email: ronen...@il.ibm.com From: Luohao (brian) brian.luo...@huawei.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Date: 14/03/2014 03:59 AM Subject:Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder 1. fsfreeze with vss has been added to qemu upstream, see http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg01963.html for usage. 2. libvirt allows a client to send any commands to qemu-ga, see http://wiki.libvirt.org/page/Qemu_guest_agent 3. linux fsfreeze is not equivalent to windows fsfreeze+vss. Linux fsreeze offers fs consistency only, while windows vss allows agents like sqlserver to register their plugins to flush their cache to disk when a snapshot occurs. 4. my understanding is xenserver does not support fsfreeze+vss now, because xenserver normally does not use block backend in qemu. -Original Message- From: Bruce Montague [mailto:bruce_monta...@symantec.com] Sent: Thursday, March 13, 2014 10:35 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Hi, about OpenStack and VSS. Does anyone have experience with the qemu project's implementation of VSS support? They appear to have a within-guest agent, qemu-ga, that perhaps can work as a VSS requestor. Does it also work with KVM? Does qemu-ga work with libvirt (can VSS quiesce be triggered via libvirt)? I think there was an effort for qemu-ga to use fsfreeze as an equivalent to VSS on Linux systems, was that done? If so, could an OpenStack API provide a generic quiesce request that would then get passed to libvirt? (Also, the XenServer VSS support seems different than qemu/KVM's, is this true? Can it also be accessed through libvirt? Thanks, -bruce -Original Message- From: Alessandro Pilotti [mailto:apilo...@cloudbasesolutions.com] Sent: Thursday, March 13, 2014 6:49 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Those use cases are very important in enterprise scenarios requirements, but there's an important missing piece in the current OpenStack APIs: support for application consistent backups via Volume Shadow Copy (or other solutions) at the instance level, including differential / incremental backups. VSS can be seamlessly added to the Nova Hyper-V driver (it's included with the free Hyper-V Server) with e.g. vSphere and XenServer supporting it as well (quescing) and with the option for third party vendors to add drivers for their solutions. A generic Nova backup / restore API supporting those features is quite straightforward to design. The main question at this stage is if the OpenStack community wants to support those use cases or not. Cinder backup/restore support [1] and volume replication [2] are surely a great starting point in this direction. Alessandro [1] https://review.openstack.org/#/c/69351/ [2] https://review.openstack.org/#/c/64026/ On 12/mar/2014, at 20:45, Bruce Montague bruce_monta...@symantec.com wrote: Hi, regarding the call to create a list of disaster recovery (DR) use cases ( http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html ), the following list sketches some speculative OpenStack DR use cases. These use cases do not reflect any specific product behavior and span a wide spectrum. This list is not a proposal, it is intended primarily to solicit additional discussion. The first basic use case, (1), is described in a bit more detail than the others; many of the others are elaborations on this basic theme. * (1) [Single VM] A single Windows VM with 4 volumes and VSS (Microsoft's Volume Shadowcopy Services) installed runs a key application and integral database. VSS can quiesce the app, database, filesystem, and I/O on demand and can be invoked external to the guest. a. The VM's volumes, including the boot volume, are replicated to a remote DR site (another OpenStack deployment). b. Some form of replicated VM or VM metadata exists at the remote site. This VM/description includes the replicated volumes. Some systems might use cold migration or some form of wide-area live VM migration to establish this remote site VM/description. c. When specified
Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder
About the (1) [Single VM], the use cases as follows can be supplement. 1. Protection Group: Define the set of instances to be protected. 2. Protection Policy: Define the policy for protection group, such as sync period, sync priority, advanced features, etc. 3. Recovery Plan:Define the recovery steps during recovery, such as the power-off and boot order of instances, etc -- zhangleiqiang (Ray) Best Regards -Original Message- From: Bruce Montague [mailto:bruce_monta...@symantec.com] Sent: Thursday, March 13, 2014 2:38 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Hi, regarding the call to create a list of disaster recovery (DR) use cases ( http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html ), the following list sketches some speculative OpenStack DR use cases. These use cases do not reflect any specific product behavior and span a wide spectrum. This list is not a proposal, it is intended primarily to solicit additional discussion. The first basic use case, (1), is described in a bit more detail than the others; many of the others are elaborations on this basic theme. * (1) [Single VM] A single Windows VM with 4 volumes and VSS (Microsoft's Volume Shadowcopy Services) installed runs a key application and integral database. VSS can quiesce the app, database, filesystem, and I/O on demand and can be invoked external to the guest. a. The VM's volumes, including the boot volume, are replicated to a remote DR site (another OpenStack deployment). b. Some form of replicated VM or VM metadata exists at the remote site. This VM/description includes the replicated volumes. Some systems might use cold migration or some form of wide-area live VM migration to establish this remote site VM/description. c. When specified by an SLA or policy, VSS is invoked, putting the VM's volumes in an application-consistent state. This state is flushed all the way through to the remote volumes. As each remote volume reaches its application-consistent state, this is recognized in some fashion, perhaps by an in-band signal, and a snapshot of the volume is made at the remote site. Volume replication is re-enabled immediately following the snapshot. A backup is then made of the snapshot on the remote site. At the completion of this cycle, application-consistent volume snapshots and backups exist on the remote site. d. When a disaster or firedrill happens, the replication network connection is cut. The remote site VM pre-created or defined so as to use the replicated volumes is then booted, using the latest application-consistent state of the replicated volumes. The entire VM environment (management accounts, networking, external firewalling, console access, etc..), similar to that of the primary, either needs to pre-exist in some fashion on the secondary or be created dynamically by the DR system. The booting VM either needs to attach to a virtual network environment similar to at the primary site or the VM needs to have boot code that can alter its network personality. Networking configuration may occur in conjunction with an update to DNS and other networking infrastructure. It is necessary for all required networking configuration to be pre-specified or done automatically. No manual admin activity should be required. Environment requirements may be stored in a DR configuration ! or database associated with the replication. e. In a firedrill or test, the virtual network environment at the remote site may be a test bubble isolated from the real network, with some provision for protected access (such as NAT). Automatic testing is necessary to verify that replication succeeded. These tests need to be configurable by the end-user and admin and integrated with DR orchestration. f. After the VM has booted and been operational, the network connection between the two sites is re-established. A replication connection between the replicated volumes is restablished, and the replicated volumes are re-synced, with the roles of primary and secondary reversed. (Ongoing replication in this configuration may occur, driven from the new primary.) g. A planned failback of the VM to the old primary proceeds similar to the failover from the old primary to the old replica, but with roles reversed and the process minimizing offline time and data loss. * (2) [Core tenant/project infrastructure VMs] Twenty VMs power the core infrastructure of a group using a private cloud (OpenStack in their own datacenter). Not all VMs run Windows with VSS, some run Linux with some equivalent mechanism, such as qemu-ga, driving fsfreeze and signal scripts. These VMs are replicated to a remote OpenStack deployment, in a fashion similar to (1). Orchestration occurring at the remote site on failover is more
Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder
Those use cases are very important in enterprise scenarios requirements, but there's an important missing piece in the current OpenStack APIs: support for application consistent backups via Volume Shadow Copy (or other solutions) at the instance level, including differential / incremental backups. VSS can be seamlessly added to the Nova Hyper-V driver (it's included with the free Hyper-V Server) with e.g. vSphere and XenServer supporting it as well (quescing) and with the option for third party vendors to add drivers for their solutions. A generic Nova backup / restore API supporting those features is quite straightforward to design. The main question at this stage is if the OpenStack community wants to support those use cases or not. Cinder backup/restore support [1] and volume replication [2] are surely a great starting point in this direction. Alessandro [1] https://review.openstack.org/#/c/69351/ [2] https://review.openstack.org/#/c/64026/ On 12/mar/2014, at 20:45, Bruce Montague bruce_monta...@symantec.com wrote: Hi, regarding the call to create a list of disaster recovery (DR) use cases ( http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html ), the following list sketches some speculative OpenStack DR use cases. These use cases do not reflect any specific product behavior and span a wide spectrum. This list is not a proposal, it is intended primarily to solicit additional discussion. The first basic use case, (1), is described in a bit more detail than the others; many of the others are elaborations on this basic theme. * (1) [Single VM] A single Windows VM with 4 volumes and VSS (Microsoft's Volume Shadowcopy Services) installed runs a key application and integral database. VSS can quiesce the app, database, filesystem, and I/O on demand and can be invoked external to the guest. a. The VM's volumes, including the boot volume, are replicated to a remote DR site (another OpenStack deployment). b. Some form of replicated VM or VM metadata exists at the remote site. This VM/description includes the replicated volumes. Some systems might use cold migration or some form of wide-area live VM migration to establish this remote site VM/description. c. When specified by an SLA or policy, VSS is invoked, putting the VM's volumes in an application-consistent state. This state is flushed all the way through to the remote volumes. As each remote volume reaches its application-consistent state, this is recognized in some fashion, perhaps by an in-band signal, and a snapshot of the volume is made at the remote site. Volume replication is re-enabled immediately following the snapshot. A backup is then made of the snapshot on the remote site. At the completion of this cycle, application-consistent volume snapshots and backups exist on the remote site. d. When a disaster or firedrill happens, the replication network connection is cut. The remote site VM pre-created or defined so as to use the replicated volumes is then booted, using the latest application-consistent state of the replicated volumes. The entire VM environment (management accounts, networking, external firewalling, console access, etc..), similar to that of the primary, either needs to pre-exist in some fashion on the secondary or be created dynamically by the DR system. The booting VM either needs to attach to a virtual network environment similar to at the primary site or the VM needs to have boot code that can alter its network personality. Networking configuration may occur in conjunction with an update to DNS and other networking infrastructure. It is necessary for all required networking configuration to be pre-specified or done automatically. No manual admin activity should be required. Environment requirements may be stored in a DR configuration o r database associated with the replication. e. In a firedrill or test, the virtual network environment at the remote site may be a test bubble isolated from the real network, with some provision for protected access (such as NAT). Automatic testing is necessary to verify that replication succeeded. These tests need to be configurable by the end-user and admin and integrated with DR orchestration. f. After the VM has booted and been operational, the network connection between the two sites is re-established. A replication connection between the replicated volumes is restablished, and the replicated volumes are re-synced, with the roles of primary and secondary reversed. (Ongoing replication in this configuration may occur, driven from the new primary.) g. A planned failback of the VM to the old primary proceeds similar to the failover from the old primary to the old replica, but with roles reversed and the process minimizing offline time and data loss. * (2) [Core tenant/project infrastructure VMs] Twenty VMs
Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder
Hi, about OpenStack and VSS. Does anyone have experience with the qemu project's implementation of VSS support? They appear to have a within-guest agent, qemu-ga, that perhaps can work as a VSS requestor. Does it also work with KVM? Does qemu-ga work with libvirt (can VSS quiesce be triggered via libvirt)? I think there was an effort for qemu-ga to use fsfreeze as an equivalent to VSS on Linux systems, was that done? If so, could an OpenStack API provide a generic quiesce request that would then get passed to libvirt? (Also, the XenServer VSS support seems different than qemu/KVM's, is this true? Can it also be accessed through libvirt? Thanks, -bruce -Original Message- From: Alessandro Pilotti [mailto:apilo...@cloudbasesolutions.com] Sent: Thursday, March 13, 2014 6:49 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Those use cases are very important in enterprise scenarios requirements, but there's an important missing piece in the current OpenStack APIs: support for application consistent backups via Volume Shadow Copy (or other solutions) at the instance level, including differential / incremental backups. VSS can be seamlessly added to the Nova Hyper-V driver (it's included with the free Hyper-V Server) with e.g. vSphere and XenServer supporting it as well (quescing) and with the option for third party vendors to add drivers for their solutions. A generic Nova backup / restore API supporting those features is quite straightforward to design. The main question at this stage is if the OpenStack community wants to support those use cases or not. Cinder backup/restore support [1] and volume replication [2] are surely a great starting point in this direction. Alessandro [1] https://review.openstack.org/#/c/69351/ [2] https://review.openstack.org/#/c/64026/ On 12/mar/2014, at 20:45, Bruce Montague bruce_monta...@symantec.com wrote: Hi, regarding the call to create a list of disaster recovery (DR) use cases ( http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html ), the following list sketches some speculative OpenStack DR use cases. These use cases do not reflect any specific product behavior and span a wide spectrum. This list is not a proposal, it is intended primarily to solicit additional discussion. The first basic use case, (1), is described in a bit more detail than the others; many of the others are elaborations on this basic theme. * (1) [Single VM] A single Windows VM with 4 volumes and VSS (Microsoft's Volume Shadowcopy Services) installed runs a key application and integral database. VSS can quiesce the app, database, filesystem, and I/O on demand and can be invoked external to the guest. a. The VM's volumes, including the boot volume, are replicated to a remote DR site (another OpenStack deployment). b. Some form of replicated VM or VM metadata exists at the remote site. This VM/description includes the replicated volumes. Some systems might use cold migration or some form of wide-area live VM migration to establish this remote site VM/description. c. When specified by an SLA or policy, VSS is invoked, putting the VM's volumes in an application-consistent state. This state is flushed all the way through to the remote volumes. As each remote volume reaches its application-consistent state, this is recognized in some fashion, perhaps by an in-band signal, and a snapshot of the volume is made at the remote site. Volume replication is re-enabled immediately following the snapshot. A backup is then made of the snapshot on the remote site. At the completion of this cycle, application-consistent volume snapshots and backups exist on the remote site. d. When a disaster or firedrill happens, the replication network connection is cut. The remote site VM pre-created or defined so as to use the replicated volumes is then booted, using the latest application-consistent state of the replicated volumes. The entire VM environment (management accounts, networking, external firewalling, console access, etc..), similar to that of the primary, either needs to pre-exist in some fashion on the secondary or be created dynamically by the DR system. The booting VM either needs to attach to a virtual network environment similar to at the primary site or the VM needs to have boot code that can alter its network personality. Networking configuration may occur in conjunction with an update to DNS and other networking infrastructure. It is necessary for all required networking configuration to be pre-specified or done automatically. No manual admin activity should be required. Environment requirements may be stored in a DR configuration o r database associated with the replication. e. In a firedrill or test, the virtual network environment at the remote site may be a test bubble isolated
Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder
Bruce, Nice list of use cases; thank you for sharing. One thought Bruce Montague bruce_monta...@symantec.com wrote on 13/03/2014 04:34:59 PM: * (2) [Core tenant/project infrastructure VMs] Twenty VMs power the core infrastructure of a group using a private cloud (OpenStack in their own datacenter). Not all VMs run Windows with VSS, some run Linux with some equivalent mechanism, such as qemu-ga, driving fsfreeze and signal scripts. These VMs are replicated to a remote OpenStack deployment, in a fashion similar to (1). Orchestration occurring at the remote site on failover is more complex (correct VM boot order is orchestrated, DHCP service is configured as expected, all IPs are made available and verified). An equivalent virtual network topology consisting of multiple networks or subnets might be pre-created or dynamically created at failover time. a. Storage for all volumes of all VMs might be on a single storage backend (logically a single large volume containing many smaller sub-volumes, examples being a VMware datastore or Hyper-V CSV). This entire large volume might be replicated between similar storage backends at the primary and secondary site. A single replicated large volume thus replicates all the tenant VM's volumes. The DR system must trigger quiesce of all volumes to application- consistent state. A variant of having logically a single volume on a single storage backend is having all the volumes allocated from storage that provides consistency groups. This may also be related to cross VM consistent backups/snapshots. Of course a question would be whether, and if so, how to surface this. -- Michael ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder
Funny this topic came up. I was just looking into some of this yesterday. Here's some links that I came up with: * https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sub-sect-qemu-ga-freeze-thaw.html - Describes how application level safe backups of vm's can be accomplished. Didn't have the proper framework prior to RedHat 6.5. Looks reasonable now. * http://lists.gnu.org/archive/html/qemu-devel/2012-11/msg01043.html - An example of a hook that lets you snapshot mysql safely while it is still running. * https://wiki.openstack.org/wiki/Cinder/QuiescedSnapshotWithQemuGuestAgent - A blueprint for making safe live snapshots enabled via the Cinder api. Its not there yet, but being worked on. * https://blueprints.launchpad.net/nova/+spec/qemu-guest-agent-support - Nova supports freeze/thawing the instance. Thanks, Kevin From: Bruce Montague [bruce_monta...@symantec.com] Sent: Thursday, March 13, 2014 7:34 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Hi, about OpenStack and VSS. Does anyone have experience with the qemu project's implementation of VSS support? They appear to have a within-guest agent, qemu-ga, that perhaps can work as a VSS requestor. Does it also work with KVM? Does qemu-ga work with libvirt (can VSS quiesce be triggered via libvirt)? I think there was an effort for qemu-ga to use fsfreeze as an equivalent to VSS on Linux systems, was that done? If so, could an OpenStack API provide a generic quiesce request that would then get passed to libvirt? (Also, the XenServer VSS support seems different than qemu/KVM's, is this true? Can it also be accessed through libvirt? Thanks, -bruce -Original Message- From: Alessandro Pilotti [mailto:apilo...@cloudbasesolutions.com] Sent: Thursday, March 13, 2014 6:49 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Those use cases are very important in enterprise scenarios requirements, but there's an important missing piece in the current OpenStack APIs: support for application consistent backups via Volume Shadow Copy (or other solutions) at the instance level, including differential / incremental backups. VSS can be seamlessly added to the Nova Hyper-V driver (it's included with the free Hyper-V Server) with e.g. vSphere and XenServer supporting it as well (quescing) and with the option for third party vendors to add drivers for their solutions. A generic Nova backup / restore API supporting those features is quite straightforward to design. The main question at this stage is if the OpenStack community wants to support those use cases or not. Cinder backup/restore support [1] and volume replication [2] are surely a great starting point in this direction. Alessandro [1] https://review.openstack.org/#/c/69351/ [2] https://review.openstack.org/#/c/64026/ On 12/mar/2014, at 20:45, Bruce Montague bruce_monta...@symantec.com wrote: Hi, regarding the call to create a list of disaster recovery (DR) use cases ( http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html ), the following list sketches some speculative OpenStack DR use cases. These use cases do not reflect any specific product behavior and span a wide spectrum. This list is not a proposal, it is intended primarily to solicit additional discussion. The first basic use case, (1), is described in a bit more detail than the others; many of the others are elaborations on this basic theme. * (1) [Single VM] A single Windows VM with 4 volumes and VSS (Microsoft's Volume Shadowcopy Services) installed runs a key application and integral database. VSS can quiesce the app, database, filesystem, and I/O on demand and can be invoked external to the guest. a. The VM's volumes, including the boot volume, are replicated to a remote DR site (another OpenStack deployment). b. Some form of replicated VM or VM metadata exists at the remote site. This VM/description includes the replicated volumes. Some systems might use cold migration or some form of wide-area live VM migration to establish this remote site VM/description. c. When specified by an SLA or policy, VSS is invoked, putting the VM's volumes in an application-consistent state. This state is flushed all the way through to the remote volumes. As each remote volume reaches its application-consistent state, this is recognized in some fashion, perhaps by an in-band signal, and a snapshot of the volume is made at the remote site. Volume replication is re-enabled immediately following the snapshot. A backup is then made of the snapshot on the remote site. At the completion of this cycle, application-consistent volume snapshots and backups exist
Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder
1. fsfreeze with vss has been added to qemu upstream, see http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg01963.html for usage. 2. libvirt allows a client to send any commands to qemu-ga, see http://wiki.libvirt.org/page/Qemu_guest_agent 3. linux fsfreeze is not equivalent to windows fsfreeze+vss. Linux fsreeze offers fs consistency only, while windows vss allows agents like sqlserver to register their plugins to flush their cache to disk when a snapshot occurs. 4. my understanding is xenserver does not support fsfreeze+vss now, because xenserver normally does not use block backend in qemu. -Original Message- From: Bruce Montague [mailto:bruce_monta...@symantec.com] Sent: Thursday, March 13, 2014 10:35 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Hi, about OpenStack and VSS. Does anyone have experience with the qemu project's implementation of VSS support? They appear to have a within-guest agent, qemu-ga, that perhaps can work as a VSS requestor. Does it also work with KVM? Does qemu-ga work with libvirt (can VSS quiesce be triggered via libvirt)? I think there was an effort for qemu-ga to use fsfreeze as an equivalent to VSS on Linux systems, was that done? If so, could an OpenStack API provide a generic quiesce request that would then get passed to libvirt? (Also, the XenServer VSS support seems different than qemu/KVM's, is this true? Can it also be accessed through libvirt? Thanks, -bruce -Original Message- From: Alessandro Pilotti [mailto:apilo...@cloudbasesolutions.com] Sent: Thursday, March 13, 2014 6:49 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder Those use cases are very important in enterprise scenarios requirements, but there's an important missing piece in the current OpenStack APIs: support for application consistent backups via Volume Shadow Copy (or other solutions) at the instance level, including differential / incremental backups. VSS can be seamlessly added to the Nova Hyper-V driver (it's included with the free Hyper-V Server) with e.g. vSphere and XenServer supporting it as well (quescing) and with the option for third party vendors to add drivers for their solutions. A generic Nova backup / restore API supporting those features is quite straightforward to design. The main question at this stage is if the OpenStack community wants to support those use cases or not. Cinder backup/restore support [1] and volume replication [2] are surely a great starting point in this direction. Alessandro [1] https://review.openstack.org/#/c/69351/ [2] https://review.openstack.org/#/c/64026/ On 12/mar/2014, at 20:45, Bruce Montague bruce_monta...@symantec.com wrote: Hi, regarding the call to create a list of disaster recovery (DR) use cases ( http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html ), the following list sketches some speculative OpenStack DR use cases. These use cases do not reflect any specific product behavior and span a wide spectrum. This list is not a proposal, it is intended primarily to solicit additional discussion. The first basic use case, (1), is described in a bit more detail than the others; many of the others are elaborations on this basic theme. * (1) [Single VM] A single Windows VM with 4 volumes and VSS (Microsoft's Volume Shadowcopy Services) installed runs a key application and integral database. VSS can quiesce the app, database, filesystem, and I/O on demand and can be invoked external to the guest. a. The VM's volumes, including the boot volume, are replicated to a remote DR site (another OpenStack deployment). b. Some form of replicated VM or VM metadata exists at the remote site. This VM/description includes the replicated volumes. Some systems might use cold migration or some form of wide-area live VM migration to establish this remote site VM/description. c. When specified by an SLA or policy, VSS is invoked, putting the VM's volumes in an application-consistent state. This state is flushed all the way through to the remote volumes. As each remote volume reaches its application-consistent state, this is recognized in some fashion, perhaps by an in-band signal, and a snapshot of the volume is made at the remote site. Volume replication is re-enabled immediately following the snapshot. A backup is then made of the snapshot on the remote site. At the completion of this cycle, application-consistent volume snapshots and backups exist on the remote site. d. When a disaster or firedrill happens, the replication network connection is cut. The remote site VM pre-created or defined so as to use the replicated volumes is then booted, using the latest application-consistent state of the replicated volumes. The entire VM
[openstack-dev] Disaster Recovery for OpenStack - call for stakeholder
Hello, In the Hong-Kong summit, there was a lot of interest around OpenStack support for Disaster Recovery including a design summit session, an un-conference session and a break-out session. In addition we set up a Wiki for OpenStack disaster recovery - see https://wiki.openstack.org/wiki/DisasterRecovery The first step was enabling volume replication in Cinder, which has started in the Icehouse development cycle and will continue into Juno. Toward the Juno summit and development cycle we would like to send out a call for disaster recovery stakeholders, looking to: * Create a list of use-cases and scenarios for disaster recovery with OpenStack * Find interested parties who wish to contribute features and code to advance disaster recovery in OpenStack * Plan needed for discussions at the Juno summit To coordinate such efforts, I would like to invite you to a conference call on Wednesday March 5 at 12pm ET and work together coordinating actions for the Juno summit (an invitation is attached). We will record minutes of the call at - https://etherpad.openstack.org/p/juno-disaster-recovery-call-for-stakeholders (link also available from the disaster recovery wiki page). If you are unable to join and interested, please register your self and share your thoughts. Call in numbers are available at https://www.teleconference.att.com/servlet/glbAccess?process=1accessCode=6406941accessNumber=1809417783#C2 Passcode: 6406941 Regards, __ Ronen I. Kat, PhD Storage Research IBM Research - Haifa Phone: +972.3.7689493 Email: ronen...@il.ibm.com invite.ics Description: Binary data ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev