Re: [DISCUSS] VR upgrade downtime reduction
Yes, nice work! From: Daan Hoogland Sent: Tuesday, May 1, 2018 5:28 AM To: us...@cloudstack.apache.org Cc: dev Subject: Re: [DISCUSS] VR upgrade downtime reduction good work Rohit, I'll review 2508 https://github.com/apache/cloudstack/pull/2508 On Tue, May 1, 2018 at 12:08 PM, Rohit Yadav wrote: > All, > > > A short-term solution to VR upgrade or network restart (with cleanup=true) > has been implemented: > > > - The strategy for redundant VRs builds on top of Wei's original patch > where backup routers are removed and replace in a rolling basis. The > downtime I saw was usually 0-2 seconds, and theoretically downtime is > maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds > (with cloudstack's default of 1s advertisement interval). > > > - For non-redundant routers, I've implemented a strategy where first a new > VR is deployed, then old VR is powered-off/destroyed, and the new VR is > again re-programmed. With this strategy, two identical VRs may be up for a > brief moment (few seconds) where both can serve traffic, however the new VR > performs arp-ping on its interfaces to update neighbours. After the old VR > is removed, the new VR is re-programmed which among many things performs > another arpping. The theoretical downtime is therefore limited by the > arp-cache refresh which can be up to 30 seconds. In my experiments, against > various VMware, KVM and XenServer versions I found that the downtime was > indeed less than 30s, usually between 5-20 seconds. Compared to older ACS > versions, especially in cases where VRs deployment require full volume copy > (like in VMware) a 10x-12x improvement was seen. > > > Please review, test the following PRs which has test details, benchmarks, > and some screenshots: > > https://github.com/apache/cloudstack/pull/2508 > > > Future work can be driven towards making all VRs redundant enabled by > default that can allow for a firewall+connections state transfer > (conntrackd + VRRP2/3 based) during rolling reboots. > > > - Rohit > > <https://cloudstack.apache.org> > > > > ____ > From: Daan Hoogland > Sent: Thursday, February 8, 2018 3:11:51 PM > To: dev > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > to stop the vote and continue the discussion. I personally want unification > of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the > one we want to create for 'enterprise topology hand-off points'. And I > think we have some level of consensus on that but the path there is a > concern for Wido and for some of my colleagues as well, and rightly so. One > issue is upgrades from older versions. > > I the common scenario as follows: > + redundancy is deprecated and only number of instances remain. > + an old VR is replicated in memory by an redundant enabled version, that > will be in a state of running but inactive. > - the old one will be destroyed while a ping is running > - as soon as the ping fails more then three times in a row (this might have > to have a hypervisor specific implementation or require a helper vm) > + the new one is activated > > after this upgrade Wei's and/or Remi's code will do the work for any > following upgrade. > > flames, please > > > > On Wed, Feb 7, 2018 at 12:17 PM, Nux! wrote: > > > +1 too > > > > -- > > Sent from the Delta quadrant using Borg technology! > > > > Nux! > > www.nux.ro > > > > > rohit.ya...@shapeblue.com > www.shapeblue.com<http://www.shapeblue.com> > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > > - Original Message - > > > From: "Rene Moser" > > > To: "dev" > > > Sent: Wednesday, 7 February, 2018 10:11:45 > > > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > > > > On 02/06/2018 02:47 PM, Remi Bergsma wrote: > > >> Hi Daan, > > >> > > >> In my opinion the biggest issue is the fact that there are a lot of > > different > > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's > > why you > > >> cannot simply switch from a single VPC to a redundant VPC for example. > > >> > > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a > > VPC with a > > >> single tier and made sure all features are supported. Next we merged > > the single > > >> and redundant VPC code paths. The idea here is that redundancy or not > > should > > >> only be a differ
Re: [DISCUSS] VR upgrade downtime reduction
good work Rohit, I'll review 2508 https://github.com/apache/cloudstack/pull/2508 On Tue, May 1, 2018 at 12:08 PM, Rohit Yadav wrote: > All, > > > A short-term solution to VR upgrade or network restart (with cleanup=true) > has been implemented: > > > - The strategy for redundant VRs builds on top of Wei's original patch > where backup routers are removed and replace in a rolling basis. The > downtime I saw was usually 0-2 seconds, and theoretically downtime is > maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds > (with cloudstack's default of 1s advertisement interval). > > > - For non-redundant routers, I've implemented a strategy where first a new > VR is deployed, then old VR is powered-off/destroyed, and the new VR is > again re-programmed. With this strategy, two identical VRs may be up for a > brief moment (few seconds) where both can serve traffic, however the new VR > performs arp-ping on its interfaces to update neighbours. After the old VR > is removed, the new VR is re-programmed which among many things performs > another arpping. The theoretical downtime is therefore limited by the > arp-cache refresh which can be up to 30 seconds. In my experiments, against > various VMware, KVM and XenServer versions I found that the downtime was > indeed less than 30s, usually between 5-20 seconds. Compared to older ACS > versions, especially in cases where VRs deployment require full volume copy > (like in VMware) a 10x-12x improvement was seen. > > > Please review, test the following PRs which has test details, benchmarks, > and some screenshots: > > https://github.com/apache/cloudstack/pull/2508 > > > Future work can be driven towards making all VRs redundant enabled by > default that can allow for a firewall+connections state transfer > (conntrackd + VRRP2/3 based) during rolling reboots. > > > - Rohit > > <https://cloudstack.apache.org> > > > > ____ > From: Daan Hoogland > Sent: Thursday, February 8, 2018 3:11:51 PM > To: dev > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > to stop the vote and continue the discussion. I personally want unification > of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the > one we want to create for 'enterprise topology hand-off points'. And I > think we have some level of consensus on that but the path there is a > concern for Wido and for some of my colleagues as well, and rightly so. One > issue is upgrades from older versions. > > I the common scenario as follows: > + redundancy is deprecated and only number of instances remain. > + an old VR is replicated in memory by an redundant enabled version, that > will be in a state of running but inactive. > - the old one will be destroyed while a ping is running > - as soon as the ping fails more then three times in a row (this might have > to have a hypervisor specific implementation or require a helper vm) > + the new one is activated > > after this upgrade Wei's and/or Remi's code will do the work for any > following upgrade. > > flames, please > > > > On Wed, Feb 7, 2018 at 12:17 PM, Nux! wrote: > > > +1 too > > > > -- > > Sent from the Delta quadrant using Borg technology! > > > > Nux! > > www.nux.ro > > > > > rohit.ya...@shapeblue.com > www.shapeblue.com > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > > - Original Message - > > > From: "Rene Moser" > > > To: "dev" > > > Sent: Wednesday, 7 February, 2018 10:11:45 > > > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > > > > On 02/06/2018 02:47 PM, Remi Bergsma wrote: > > >> Hi Daan, > > >> > > >> In my opinion the biggest issue is the fact that there are a lot of > > different > > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's > > why you > > >> cannot simply switch from a single VPC to a redundant VPC for example. > > >> > > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a > > VPC with a > > >> single tier and made sure all features are supported. Next we merged > > the single > > >> and redundant VPC code paths. The idea here is that redundancy or not > > should > > >> only be a difference in the number of routers. Code should be the > same. > > A > > >> single router, is also "master" but there just is no "backup". > > >> > > >> That simplifies thi
Re: [DISCUSS] VR upgrade downtime reduction
All, A short-term solution to VR upgrade or network restart (with cleanup=true) has been implemented: - The strategy for redundant VRs builds on top of Wei's original patch where backup routers are removed and replace in a rolling basis. The downtime I saw was usually 0-2 seconds, and theoretically downtime is maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds (with cloudstack's default of 1s advertisement interval). - For non-redundant routers, I've implemented a strategy where first a new VR is deployed, then old VR is powered-off/destroyed, and the new VR is again re-programmed. With this strategy, two identical VRs may be up for a brief moment (few seconds) where both can serve traffic, however the new VR performs arp-ping on its interfaces to update neighbours. After the old VR is removed, the new VR is re-programmed which among many things performs another arpping. The theoretical downtime is therefore limited by the arp-cache refresh which can be up to 30 seconds. In my experiments, against various VMware, KVM and XenServer versions I found that the downtime was indeed less than 30s, usually between 5-20 seconds. Compared to older ACS versions, especially in cases where VRs deployment require full volume copy (like in VMware) a 10x-12x improvement was seen. Please review, test the following PRs which has test details, benchmarks, and some screenshots: https://github.com/apache/cloudstack/pull/2508 Future work can be driven towards making all VRs redundant enabled by default that can allow for a firewall+connections state transfer (conntrackd + VRRP2/3 based) during rolling reboots. - Rohit <https://cloudstack.apache.org> From: Daan Hoogland Sent: Thursday, February 8, 2018 3:11:51 PM To: dev Subject: Re: [DISCUSS] VR upgrade downtime reduction to stop the vote and continue the discussion. I personally want unification of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the one we want to create for 'enterprise topology hand-off points'. And I think we have some level of consensus on that but the path there is a concern for Wido and for some of my colleagues as well, and rightly so. One issue is upgrades from older versions. I the common scenario as follows: + redundancy is deprecated and only number of instances remain. + an old VR is replicated in memory by an redundant enabled version, that will be in a state of running but inactive. - the old one will be destroyed while a ping is running - as soon as the ping fails more then three times in a row (this might have to have a hypervisor specific implementation or require a helper vm) + the new one is activated after this upgrade Wei's and/or Remi's code will do the work for any following upgrade. flames, please On Wed, Feb 7, 2018 at 12:17 PM, Nux! wrote: > +1 too > > -- > Sent from the Delta quadrant using Borg technology! > > Nux! > www.nux.ro > > rohit.ya...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue - Original Message - > > From: "Rene Moser" > > To: "dev" > > Sent: Wednesday, 7 February, 2018 10:11:45 > > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > > On 02/06/2018 02:47 PM, Remi Bergsma wrote: > >> Hi Daan, > >> > >> In my opinion the biggest issue is the fact that there are a lot of > different > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's > why you > >> cannot simply switch from a single VPC to a redundant VPC for example. > >> > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a > VPC with a > >> single tier and made sure all features are supported. Next we merged > the single > >> and redundant VPC code paths. The idea here is that redundancy or not > should > >> only be a difference in the number of routers. Code should be the same. > A > >> single router, is also "master" but there just is no "backup". > >> > >> That simplifies things A LOT, as keepalived is now the master of the > whole > >> thing. No more assigning ip addresses in Python, but leave that to > keepalived > >> instead. Lots of code deleted. Easier to maintain, way more stable. We > just > >> released Cosmic 6 that has this feature and are now rolling it out in > >> production. Looking good so far. This change unlocks a lot of > possibilities, > >> like live upgrading from a single VPC to a redundant one (and back). In > the > >> end, if the redundant VPC is rock solid, you most likely don't even > want single > >> VPCs any more. But that will come. > >> > >> As I said, we're rolling this out as we speak. In a few weeks when > everything is > >> upgraded I can share what we learned and how well it works. CloudStack > could > >> use a similar approach. > > > > +1 Pretty much this. > > > > René > -- Daan
Re: [DISCUSS] VR upgrade downtime reduction
to stop the vote and continue the discussion. I personally want unification of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the one we want to create for 'enterprise topology hand-off points'. And I think we have some level of consensus on that but the path there is a concern for Wido and for some of my colleagues as well, and rightly so. One issue is upgrades from older versions. I the common scenario as follows: + redundancy is deprecated and only number of instances remain. + an old VR is replicated in memory by an redundant enabled version, that will be in a state of running but inactive. - the old one will be destroyed while a ping is running - as soon as the ping fails more then three times in a row (this might have to have a hypervisor specific implementation or require a helper vm) + the new one is activated after this upgrade Wei's and/or Remi's code will do the work for any following upgrade. flames, please On Wed, Feb 7, 2018 at 12:17 PM, Nux! wrote: > +1 too > > -- > Sent from the Delta quadrant using Borg technology! > > Nux! > www.nux.ro > > - Original Message - > > From: "Rene Moser" > > To: "dev" > > Sent: Wednesday, 7 February, 2018 10:11:45 > > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > > On 02/06/2018 02:47 PM, Remi Bergsma wrote: > >> Hi Daan, > >> > >> In my opinion the biggest issue is the fact that there are a lot of > different > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's > why you > >> cannot simply switch from a single VPC to a redundant VPC for example. > >> > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a > VPC with a > >> single tier and made sure all features are supported. Next we merged > the single > >> and redundant VPC code paths. The idea here is that redundancy or not > should > >> only be a difference in the number of routers. Code should be the same. > A > >> single router, is also "master" but there just is no "backup". > >> > >> That simplifies things A LOT, as keepalived is now the master of the > whole > >> thing. No more assigning ip addresses in Python, but leave that to > keepalived > >> instead. Lots of code deleted. Easier to maintain, way more stable. We > just > >> released Cosmic 6 that has this feature and are now rolling it out in > >> production. Looking good so far. This change unlocks a lot of > possibilities, > >> like live upgrading from a single VPC to a redundant one (and back). In > the > >> end, if the redundant VPC is rock solid, you most likely don't even > want single > >> VPCs any more. But that will come. > >> > >> As I said, we're rolling this out as we speak. In a few weeks when > everything is > >> upgraded I can share what we learned and how well it works. CloudStack > could > >> use a similar approach. > > > > +1 Pretty much this. > > > > René > -- Daan
Re: [DISCUSS] VR upgrade downtime reduction
+1 too -- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro - Original Message - > From: "Rene Moser" > To: "dev" > Sent: Wednesday, 7 February, 2018 10:11:45 > Subject: Re: [DISCUSS] VR upgrade downtime reduction > On 02/06/2018 02:47 PM, Remi Bergsma wrote: >> Hi Daan, >> >> In my opinion the biggest issue is the fact that there are a lot of different >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's why you >> cannot simply switch from a single VPC to a redundant VPC for example. >> >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC >> with a >> single tier and made sure all features are supported. Next we merged the >> single >> and redundant VPC code paths. The idea here is that redundancy or not should >> only be a difference in the number of routers. Code should be the same. A >> single router, is also "master" but there just is no "backup". >> >> That simplifies things A LOT, as keepalived is now the master of the whole >> thing. No more assigning ip addresses in Python, but leave that to keepalived >> instead. Lots of code deleted. Easier to maintain, way more stable. We just >> released Cosmic 6 that has this feature and are now rolling it out in >> production. Looking good so far. This change unlocks a lot of possibilities, >> like live upgrading from a single VPC to a redundant one (and back). In the >> end, if the redundant VPC is rock solid, you most likely don't even want >> single >> VPCs any more. But that will come. >> >> As I said, we're rolling this out as we speak. In a few weeks when >> everything is >> upgraded I can share what we learned and how well it works. CloudStack could >> use a similar approach. > > +1 Pretty much this. > > René
Re: [DISCUSS] VR upgrade downtime reduction
On 02/06/2018 02:47 PM, Remi Bergsma wrote: > Hi Daan, > > In my opinion the biggest issue is the fact that there are a lot of different > code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's why you > cannot simply switch from a single VPC to a redundant VPC for example. > > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC with > a single tier and made sure all features are supported. Next we merged the > single and redundant VPC code paths. The idea here is that redundancy or not > should only be a difference in the number of routers. Code should be the > same. A single router, is also "master" but there just is no "backup". > > That simplifies things A LOT, as keepalived is now the master of the whole > thing. No more assigning ip addresses in Python, but leave that to keepalived > instead. Lots of code deleted. Easier to maintain, way more stable. We just > released Cosmic 6 that has this feature and are now rolling it out in > production. Looking good so far. This change unlocks a lot of possibilities, > like live upgrading from a single VPC to a redundant one (and back). In the > end, if the redundant VPC is rock solid, you most likely don't even want > single VPCs any more. But that will come. > > As I said, we're rolling this out as we speak. In a few weeks when everything > is upgraded I can share what we learned and how well it works. CloudStack > could use a similar approach. +1 Pretty much this. René
Re: [DISCUSS] VR upgrade downtime reduction
ONE-VR approach in ACS 5.0. It is time to plan for a major release and break some things... On Wed, Feb 7, 2018 at 7:17 AM, Paul Angus wrote: > It seems sensible to me to have ONE VR, and I like the idea of that we all > VRs are 'redundant-ready', again supporting the ONE-VR approach. > > The question I have is: > > - how do we handle the transition - does it need ACS 5.0? > The API and the UI separate the VR and the VPC, so what is the most > logical presentation of the proposed solution to the users/operators. > > > Kind regards, > > Paul Angus > > paul.an...@shapeblue.com > www.shapeblue.com > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > > > -Original Message- > From: Daan Hoogland [mailto:daan.hoogl...@gmail.com] > Sent: 07 February 2018 08:58 > To: dev > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > Reading all the reactions I am getting wary of all the possible solutions > that we have. > We do have a fragile VR and Remi's way seems the only one to stabilise it. > It also answers the question on which of my two tactics we should follow. > Wido's abjection may be valid but services that are not started are not > crashing and thus should not hinder him. > As for Wei's changes I think the most important one is in the PR I ported > forward to master, using his older commit. I metntioned it in > > [1] https://github.com/apache/cloudstack/pull/2435 > I am looking forward to any of your PRs as well Wei. > > Making all VRs redundant is a bit of a hack and the biggest risk in it is > making sure that only one will get started. > > There is one point I'd like consensus on; We have only one system > template and we are well served by letting it have only one form as VR. Do > we agree on that? > > comments, flames, questions, regards, > > > On Tue, Feb 6, 2018 at 9:04 PM, Wei ZHOU wrote: > > > Hi Remi, > > > > Actually in our fork, there are more changes than restartnetwork and > > restart vpc, similar as your changes. > > (1) edit networks from offering with single VR to offerings with RVR, > > will hack VR (set new guest IP, start keepalived and conntrackd, > > blablabla) > > (2) restart vpc from single VR to RVR. similar changes will be made. > > The downtime is around 5s. However, these changes are based 4.7.1, we > > are not sure if it still work in 4.11 > > > > We have lots of changes , we will port the changes to 4.11 LTS and > > create PRs in the next months. > > > > -Wei > > > > > > 2018-02-06 14:47 GMT+01:00 Remi Bergsma : > > > > > Hi Daan, > > > > > > In my opinion the biggest issue is the fact that there are a lot of > > > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, > etc. > > > That's why you cannot simply switch from a single VPC to a redundant > > > VPC for example. > > > > > > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a > > > VPC with a single tier and made sure all features are supported. > > > Next we > > merged > > > the single and redundant VPC code paths. The idea here is that > > > redundancy or not should only be a difference in the number of > > > routers. Code should > > be > > > the same. A single router, is also "master" but there just is no > > "backup". > > > > > > That simplifies things A LOT, as keepalived is now the master of the > > whole > > > thing. No more assigning ip addresses in Python, but leave that to > > > keepalived instead. Lots of code deleted. Easier to maintain, way > > > more stable. We just released Cosmic 6 that has this feature and are > > > now > > rolling > > > it out in production. Looking good so far. This change unlocks a lot > > > of possibilities, like live upgrading from a single VPC to a > > > redundant one (and back). In the end, if the redundant VPC is rock > > > solid, you most > > likely > > > don't even want single VPCs any more. But that will come. > > > > > > As I said, we're rolling this out as we speak. In a few weeks when > > > everything is upgraded I can share what we learned and how well it > works. > > > CloudStack could use a similar approach. > > > > > > Kind Regards, > > > Remi > > > > > > > > > > > > On 05/02/2018, 16:44, "Daan Hoogland" > wrote: > > > > > > H devs, > > > > >
RE: [DISCUSS] VR upgrade downtime reduction
It seems sensible to me to have ONE VR, and I like the idea of that we all VRs are 'redundant-ready', again supporting the ONE-VR approach. The question I have is: - how do we handle the transition - does it need ACS 5.0? The API and the UI separate the VR and the VPC, so what is the most logical presentation of the proposed solution to the users/operators. Kind regards, Paul Angus paul.an...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue -Original Message- From: Daan Hoogland [mailto:daan.hoogl...@gmail.com] Sent: 07 February 2018 08:58 To: dev Subject: Re: [DISCUSS] VR upgrade downtime reduction Reading all the reactions I am getting wary of all the possible solutions that we have. We do have a fragile VR and Remi's way seems the only one to stabilise it. It also answers the question on which of my two tactics we should follow. Wido's abjection may be valid but services that are not started are not crashing and thus should not hinder him. As for Wei's changes I think the most important one is in the PR I ported forward to master, using his older commit. I metntioned it in > [1] https://github.com/apache/cloudstack/pull/2435 I am looking forward to any of your PRs as well Wei. Making all VRs redundant is a bit of a hack and the biggest risk in it is making sure that only one will get started. There is one point I'd like consensus on; We have only one system template and we are well served by letting it have only one form as VR. Do we agree on that? comments, flames, questions, regards, On Tue, Feb 6, 2018 at 9:04 PM, Wei ZHOU wrote: > Hi Remi, > > Actually in our fork, there are more changes than restartnetwork and > restart vpc, similar as your changes. > (1) edit networks from offering with single VR to offerings with RVR, > will hack VR (set new guest IP, start keepalived and conntrackd, > blablabla) > (2) restart vpc from single VR to RVR. similar changes will be made. > The downtime is around 5s. However, these changes are based 4.7.1, we > are not sure if it still work in 4.11 > > We have lots of changes , we will port the changes to 4.11 LTS and > create PRs in the next months. > > -Wei > > > 2018-02-06 14:47 GMT+01:00 Remi Bergsma : > > > Hi Daan, > > > > In my opinion the biggest issue is the fact that there are a lot of > > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. > > That's why you cannot simply switch from a single VPC to a redundant > > VPC for example. > > > > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a > > VPC with a single tier and made sure all features are supported. > > Next we > merged > > the single and redundant VPC code paths. The idea here is that > > redundancy or not should only be a difference in the number of > > routers. Code should > be > > the same. A single router, is also "master" but there just is no > "backup". > > > > That simplifies things A LOT, as keepalived is now the master of the > whole > > thing. No more assigning ip addresses in Python, but leave that to > > keepalived instead. Lots of code deleted. Easier to maintain, way > > more stable. We just released Cosmic 6 that has this feature and are > > now > rolling > > it out in production. Looking good so far. This change unlocks a lot > > of possibilities, like live upgrading from a single VPC to a > > redundant one (and back). In the end, if the redundant VPC is rock > > solid, you most > likely > > don't even want single VPCs any more. But that will come. > > > > As I said, we're rolling this out as we speak. In a few weeks when > > everything is upgraded I can share what we learned and how well it works. > > CloudStack could use a similar approach. > > > > Kind Regards, > > Remi > > > > > > > > On 05/02/2018, 16:44, "Daan Hoogland" wrote: > > > > H devs, > > > > I have recently (re-)submitted two PRs, one by Wei [1] and one > > by > Remi > > [2], > > that reduce downtime for redundant routers and redundant VPCs > > respectively. > > (please review those) > > Now from customers we hear that they also want to reduce downtime for > > regular VRs so as we discussed this we came to two possible > > solutions that > > we want to implement one of: > > > > 1. start and configure a new router before destroying the old > > one and then > > as a last minute action stop the old one. > > 2. make all routers start up redundancy se
Re: [DISCUSS] VR upgrade downtime reduction
Reading all the reactions I am getting wary of all the possible solutions that we have. We do have a fragile VR and Remi's way seems the only one to stabilise it. It also answers the question on which of my two tactics we should follow. Wido's abjection may be valid but services that are not started are not crashing and thus should not hinder him. As for Wei's changes I think the most important one is in the PR I ported forward to master, using his older commit. I metntioned it in > [1] https://github.com/apache/cloudstack/pull/2435 I am looking forward to any of your PRs as well Wei. Making all VRs redundant is a bit of a hack and the biggest risk in it is making sure that only one will get started. There is one point I'd like consensus on; We have only one system template and we are well served by letting it have only one form as VR. Do we agree on that? comments, flames, questions, regards, On Tue, Feb 6, 2018 at 9:04 PM, Wei ZHOU wrote: > Hi Remi, > > Actually in our fork, there are more changes than restartnetwork and > restart vpc, similar as your changes. > (1) edit networks from offering with single VR to offerings with RVR, will > hack VR (set new guest IP, start keepalived and conntrackd, blablabla) > (2) restart vpc from single VR to RVR. similar changes will be made. > The downtime is around 5s. However, these changes are based 4.7.1, we are > not sure if it still work in 4.11 > > We have lots of changes , we will port the changes to 4.11 LTS and create > PRs in the next months. > > -Wei > > > 2018-02-06 14:47 GMT+01:00 Remi Bergsma : > > > Hi Daan, > > > > In my opinion the biggest issue is the fact that there are a lot of > > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. > > That's why you cannot simply switch from a single VPC to a redundant VPC > > for example. > > > > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC > > with a single tier and made sure all features are supported. Next we > merged > > the single and redundant VPC code paths. The idea here is that redundancy > > or not should only be a difference in the number of routers. Code should > be > > the same. A single router, is also "master" but there just is no > "backup". > > > > That simplifies things A LOT, as keepalived is now the master of the > whole > > thing. No more assigning ip addresses in Python, but leave that to > > keepalived instead. Lots of code deleted. Easier to maintain, way more > > stable. We just released Cosmic 6 that has this feature and are now > rolling > > it out in production. Looking good so far. This change unlocks a lot of > > possibilities, like live upgrading from a single VPC to a redundant one > > (and back). In the end, if the redundant VPC is rock solid, you most > likely > > don't even want single VPCs any more. But that will come. > > > > As I said, we're rolling this out as we speak. In a few weeks when > > everything is upgraded I can share what we learned and how well it works. > > CloudStack could use a similar approach. > > > > Kind Regards, > > Remi > > > > > > > > On 05/02/2018, 16:44, "Daan Hoogland" wrote: > > > > H devs, > > > > I have recently (re-)submitted two PRs, one by Wei [1] and one by > Remi > > [2], > > that reduce downtime for redundant routers and redundant VPCs > > respectively. > > (please review those) > > Now from customers we hear that they also want to reduce downtime for > > regular VRs so as we discussed this we came to two possible solutions > > that > > we want to implement one of: > > > > 1. start and configure a new router before destroying the old one and > > then > > as a last minute action stop the old one. > > 2. make all routers start up redundancy services but for regular > > routers > > start only one until an upgrade is required at which time a new, > second > > router can be started before killing the old one. > > > > obviously both solutions have their merits, so I want to have your > > input > > to make the broadest supported implementation. > > -1 means there will be an overlap or a small delay and interruption > of > > service. > > +1 It can be argued, "they got what they payed for". > > -2 means a overhead in memory usage by the router by the extra > services > > running on it. > > +2 the number of router-varieties will be further reduced. > > > > -1&-2 We have to deal with potentially large upgrade steps from way > > before > > the cloudstack era even and might be stuck to 1 because of that, > > needing to > > hack around it. Any dealing with older VRs, pre 4.5 and especially > pre > > 4.0 > > will be hard. > > > > I am not cross posting though this might be one of these occasions > > where it > > is appropriate to include users@. Just my puristic inhibitions. > > > > Of course I have preferences but can you share your thoughts, please? > > > > And
Re: [DISCUSS] VR upgrade downtime reduction
Hi Remi, Actually in our fork, there are more changes than restartnetwork and restart vpc, similar as your changes. (1) edit networks from offering with single VR to offerings with RVR, will hack VR (set new guest IP, start keepalived and conntrackd, blablabla) (2) restart vpc from single VR to RVR. similar changes will be made. The downtime is around 5s. However, these changes are based 4.7.1, we are not sure if it still work in 4.11 We have lots of changes , we will port the changes to 4.11 LTS and create PRs in the next months. -Wei 2018-02-06 14:47 GMT+01:00 Remi Bergsma : > Hi Daan, > > In my opinion the biggest issue is the fact that there are a lot of > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. > That's why you cannot simply switch from a single VPC to a redundant VPC > for example. > > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC > with a single tier and made sure all features are supported. Next we merged > the single and redundant VPC code paths. The idea here is that redundancy > or not should only be a difference in the number of routers. Code should be > the same. A single router, is also "master" but there just is no "backup". > > That simplifies things A LOT, as keepalived is now the master of the whole > thing. No more assigning ip addresses in Python, but leave that to > keepalived instead. Lots of code deleted. Easier to maintain, way more > stable. We just released Cosmic 6 that has this feature and are now rolling > it out in production. Looking good so far. This change unlocks a lot of > possibilities, like live upgrading from a single VPC to a redundant one > (and back). In the end, if the redundant VPC is rock solid, you most likely > don't even want single VPCs any more. But that will come. > > As I said, we're rolling this out as we speak. In a few weeks when > everything is upgraded I can share what we learned and how well it works. > CloudStack could use a similar approach. > > Kind Regards, > Remi > > > > On 05/02/2018, 16:44, "Daan Hoogland" wrote: > > H devs, > > I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi > [2], > that reduce downtime for redundant routers and redundant VPCs > respectively. > (please review those) > Now from customers we hear that they also want to reduce downtime for > regular VRs so as we discussed this we came to two possible solutions > that > we want to implement one of: > > 1. start and configure a new router before destroying the old one and > then > as a last minute action stop the old one. > 2. make all routers start up redundancy services but for regular > routers > start only one until an upgrade is required at which time a new, second > router can be started before killing the old one. > > obviously both solutions have their merits, so I want to have your > input > to make the broadest supported implementation. > -1 means there will be an overlap or a small delay and interruption of > service. > +1 It can be argued, "they got what they payed for". > -2 means a overhead in memory usage by the router by the extra services > running on it. > +2 the number of router-varieties will be further reduced. > > -1&-2 We have to deal with potentially large upgrade steps from way > before > the cloudstack era even and might be stuck to 1 because of that, > needing to > hack around it. Any dealing with older VRs, pre 4.5 and especially pre > 4.0 > will be hard. > > I am not cross posting though this might be one of these occasions > where it > is appropriate to include users@. Just my puristic inhibitions. > > Of course I have preferences but can you share your thoughts, please? > > And don't forget to review Wei's [1] and Remi's [2] work please. > > [1] https://github.com/apache/cloudstack/pull/2435 > [2] https://github.com/apache/cloudstack/pull/2436 > > -- > Daan > > >
Re: [DISCUSS] VR upgrade downtime reduction
looking forward to your blog(s), Remi. sound like you guys are still having fun. PS did you review your PR, i submitted for you ;) ? On Tue, Feb 6, 2018 at 2:47 PM, Remi Bergsma wrote: > Hi Daan, > > In my opinion the biggest issue is the fact that there are a lot of > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. > That's why you cannot simply switch from a single VPC to a redundant VPC > for example. > > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC > with a single tier and made sure all features are supported. Next we merged > the single and redundant VPC code paths. The idea here is that redundancy > or not should only be a difference in the number of routers. Code should be > the same. A single router, is also "master" but there just is no "backup". > > That simplifies things A LOT, as keepalived is now the master of the whole > thing. No more assigning ip addresses in Python, but leave that to > keepalived instead. Lots of code deleted. Easier to maintain, way more > stable. We just released Cosmic 6 that has this feature and are now rolling > it out in production. Looking good so far. This change unlocks a lot of > possibilities, like live upgrading from a single VPC to a redundant one > (and back). In the end, if the redundant VPC is rock solid, you most likely > don't even want single VPCs any more. But that will come. > > As I said, we're rolling this out as we speak. In a few weeks when > everything is upgraded I can share what we learned and how well it works. > CloudStack could use a similar approach. > > Kind Regards, > Remi > > > > On 05/02/2018, 16:44, "Daan Hoogland" wrote: > > H devs, > > I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi > [2], > that reduce downtime for redundant routers and redundant VPCs > respectively. > (please review those) > Now from customers we hear that they also want to reduce downtime for > regular VRs so as we discussed this we came to two possible solutions > that > we want to implement one of: > > 1. start and configure a new router before destroying the old one and > then > as a last minute action stop the old one. > 2. make all routers start up redundancy services but for regular > routers > start only one until an upgrade is required at which time a new, second > router can be started before killing the old one. > > obviously both solutions have their merits, so I want to have your > input > to make the broadest supported implementation. > -1 means there will be an overlap or a small delay and interruption of > service. > +1 It can be argued, "they got what they payed for". > -2 means a overhead in memory usage by the router by the extra services > running on it. > +2 the number of router-varieties will be further reduced. > > -1&-2 We have to deal with potentially large upgrade steps from way > before > the cloudstack era even and might be stuck to 1 because of that, > needing to > hack around it. Any dealing with older VRs, pre 4.5 and especially pre > 4.0 > will be hard. > > I am not cross posting though this might be one of these occasions > where it > is appropriate to include users@. Just my puristic inhibitions. > > Of course I have preferences but can you share your thoughts, please? > > And don't forget to review Wei's [1] and Remi's [2] work please. > > [1] https://github.com/apache/cloudstack/pull/2435 > [2] https://github.com/apache/cloudstack/pull/2436 > > -- > Daan > > > -- Daan
Re: [DISCUSS] VR upgrade downtime reduction
Hi Daan, In my opinion the biggest issue is the fact that there are a lot of different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's why you cannot simply switch from a single VPC to a redundant VPC for example. For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC with a single tier and made sure all features are supported. Next we merged the single and redundant VPC code paths. The idea here is that redundancy or not should only be a difference in the number of routers. Code should be the same. A single router, is also "master" but there just is no "backup". That simplifies things A LOT, as keepalived is now the master of the whole thing. No more assigning ip addresses in Python, but leave that to keepalived instead. Lots of code deleted. Easier to maintain, way more stable. We just released Cosmic 6 that has this feature and are now rolling it out in production. Looking good so far. This change unlocks a lot of possibilities, like live upgrading from a single VPC to a redundant one (and back). In the end, if the redundant VPC is rock solid, you most likely don't even want single VPCs any more. But that will come. As I said, we're rolling this out as we speak. In a few weeks when everything is upgraded I can share what we learned and how well it works. CloudStack could use a similar approach. Kind Regards, Remi On 05/02/2018, 16:44, "Daan Hoogland" wrote: H devs, I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2], that reduce downtime for redundant routers and redundant VPCs respectively. (please review those) Now from customers we hear that they also want to reduce downtime for regular VRs so as we discussed this we came to two possible solutions that we want to implement one of: 1. start and configure a new router before destroying the old one and then as a last minute action stop the old one. 2. make all routers start up redundancy services but for regular routers start only one until an upgrade is required at which time a new, second router can be started before killing the old one. obviously both solutions have their merits, so I want to have your input to make the broadest supported implementation. -1 means there will be an overlap or a small delay and interruption of service. +1 It can be argued, "they got what they payed for". -2 means a overhead in memory usage by the router by the extra services running on it. +2 the number of router-varieties will be further reduced. -1&-2 We have to deal with potentially large upgrade steps from way before the cloudstack era even and might be stuck to 1 because of that, needing to hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0 will be hard. I am not cross posting though this might be one of these occasions where it is appropriate to include users@. Just my puristic inhibitions. Of course I have preferences but can you share your thoughts, please? And don't forget to review Wei's [1] and Remi's [2] work please. [1] https://github.com/apache/cloudstack/pull/2435 [2] https://github.com/apache/cloudstack/pull/2436 -- Daan
Re: [DISCUSS] VR upgrade downtime reduction
On 02/06/2018 12:28 PM, Daan Hoogland wrote: I'm afraid I don't agree on some of your comments, Wido. On Tue, Feb 6, 2018 at 12:03 PM, Wido den Hollander wrote: On 02/05/2018 04:44 PM, Daan Hoogland wrote: H devs, I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2], that reduce downtime for redundant routers and redundant VPCs respectively. (please review those) Now from customers we hear that they also want to reduce downtime for regular VRs so as we discussed this we came to two possible solutions that we want to implement one of: 1. start and configure a new router before destroying the old one and then as a last minute action stop the old one. Seems like a simple solution to me, this wouldn't require a lot of changes in the VR. expect add in a stop moment just before activating, that doesn't exist yet. Ah, yes. But it would mean additional tests and parameters. Not that it's impossible though. The VR is already fragile imho and could use a lot more love. Adding more features might break things which we currently have. That's my fear of working on them. 2. make all routers start up redundancy services but for regular routers start only one until an upgrade is required at which time a new, second router can be started before killing the old one. True, but that would be a problem as you would need to script a lot in the VR. all the scripts for rvr are already on the systemvm Ah, yes, for the VPC, I forgot that. obviously both solutions have their merits, so I want to have your input to make the broadest supported implementation. -1 means there will be an overlap or a small delay and interruption of service. +1 It can be argued, "they got what they payed for". -2 means a overhead in memory usage by the router by the extra services running on it. +2 the number of router-varieties will be further reduced. -1&-2 We have to deal with potentially large upgrade steps from way before the cloudstack era even and might be stuck to 1 because of that, needing to hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0 will be hard. I don't like hacking. The VRs already are 'hacky' imho. yes, it is. We (PCextreme) are only using Basic Networking so for us the VR only does DHCP and Cloud-init, so we don't care about this that much ;) thanks for the input anyway, Wido I think however that it's a valid point. The Redundant Virtual Router is mostly important when you have traffic flowing through it. So for Basic Networking it's less important or for a setup where traffic isn't going through the VR and it only does DHCP, am I correct? Wido Wido I am not cross posting though this might be one of these occasions where it is appropriate to include users@. Just my puristic inhibitions. Of course I have preferences but can you share your thoughts, please? And don't forget to review Wei's [1] and Remi's [2] work please. [1] https://github.com/apache/cloudstack/pull/2435 [2] https://github.com/apache/cloudstack/pull/2436
Re: [DISCUSS] VR upgrade downtime reduction
I'm afraid I don't agree on some of your comments, Wido. On Tue, Feb 6, 2018 at 12:03 PM, Wido den Hollander wrote: > > > On 02/05/2018 04:44 PM, Daan Hoogland wrote: > >> H devs, >> >> I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi >> [2], >> that reduce downtime for redundant routers and redundant VPCs >> respectively. >> (please review those) >> Now from customers we hear that they also want to reduce downtime for >> regular VRs so as we discussed this we came to two possible solutions that >> we want to implement one of: >> >> 1. start and configure a new router before destroying the old one and then >> as a last minute action stop the old one. >> > > Seems like a simple solution to me, this wouldn't require a lot of changes > in the VR. > expect add in a stop moment just before activating, that doesn't exist yet. > > 2. make all routers start up redundancy services but for regular routers >> start only one until an upgrade is required at which time a new, second >> router can be started before killing the old one. >> > > True, but that would be a problem as you would need to script a lot in the > VR. all the scripts for rvr are already on the systemvm > > > >> obviously both solutions have their merits, so I want to have your input >> to make the broadest supported implementation. >> -1 means there will be an overlap or a small delay and interruption of >> service. >> +1 It can be argued, "they got what they payed for". >> -2 means a overhead in memory usage by the router by the extra services >> running on it. >> +2 the number of router-varieties will be further reduced. >> >> -1&-2 We have to deal with potentially large upgrade steps from way before >> the cloudstack era even and might be stuck to 1 because of that, needing >> to >> hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0 >> will be hard. >> >> > I don't like hacking. The VRs already are 'hacky' imho. > yes, it is. > > We (PCextreme) are only using Basic Networking so for us the VR only does > DHCP and Cloud-init, so we don't care about this that much ;) > thanks for the input anyway, Wido > > Wido > > > I am not cross posting though this might be one of these occasions where it >> is appropriate to include users@. Just my puristic inhibitions. >> >> Of course I have preferences but can you share your thoughts, please? >> >> And don't forget to review Wei's [1] and Remi's [2] work please. >> >> [1] https://github.com/apache/cloudstack/pull/2435 >> [2] https://github.com/apache/cloudstack/pull/2436 >> >> -- Daan
Re: [DISCUSS] VR upgrade downtime reduction
On 02/05/2018 04:44 PM, Daan Hoogland wrote: H devs, I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2], that reduce downtime for redundant routers and redundant VPCs respectively. (please review those) Now from customers we hear that they also want to reduce downtime for regular VRs so as we discussed this we came to two possible solutions that we want to implement one of: 1. start and configure a new router before destroying the old one and then as a last minute action stop the old one. Seems like a simple solution to me, this wouldn't require a lot of changes in the VR. 2. make all routers start up redundancy services but for regular routers start only one until an upgrade is required at which time a new, second router can be started before killing the old one. True, but that would be a problem as you would need to script a lot in the VR. obviously both solutions have their merits, so I want to have your input to make the broadest supported implementation. -1 means there will be an overlap or a small delay and interruption of service. +1 It can be argued, "they got what they payed for". -2 means a overhead in memory usage by the router by the extra services running on it. +2 the number of router-varieties will be further reduced. -1&-2 We have to deal with potentially large upgrade steps from way before the cloudstack era even and might be stuck to 1 because of that, needing to hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0 will be hard. I don't like hacking. The VRs already are 'hacky' imho. We (PCextreme) are only using Basic Networking so for us the VR only does DHCP and Cloud-init, so we don't care about this that much ;) Wido I am not cross posting though this might be one of these occasions where it is appropriate to include users@. Just my puristic inhibitions. Of course I have preferences but can you share your thoughts, please? And don't forget to review Wei's [1] and Remi's [2] work please. [1] https://github.com/apache/cloudstack/pull/2435 [2] https://github.com/apache/cloudstack/pull/2436
[DISCUSS] VR upgrade downtime reduction
H devs, I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2], that reduce downtime for redundant routers and redundant VPCs respectively. (please review those) Now from customers we hear that they also want to reduce downtime for regular VRs so as we discussed this we came to two possible solutions that we want to implement one of: 1. start and configure a new router before destroying the old one and then as a last minute action stop the old one. 2. make all routers start up redundancy services but for regular routers start only one until an upgrade is required at which time a new, second router can be started before killing the old one. obviously both solutions have their merits, so I want to have your input to make the broadest supported implementation. -1 means there will be an overlap or a small delay and interruption of service. +1 It can be argued, "they got what they payed for". -2 means a overhead in memory usage by the router by the extra services running on it. +2 the number of router-varieties will be further reduced. -1&-2 We have to deal with potentially large upgrade steps from way before the cloudstack era even and might be stuck to 1 because of that, needing to hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0 will be hard. I am not cross posting though this might be one of these occasions where it is appropriate to include users@. Just my puristic inhibitions. Of course I have preferences but can you share your thoughts, please? And don't forget to review Wei's [1] and Remi's [2] work please. [1] https://github.com/apache/cloudstack/pull/2435 [2] https://github.com/apache/cloudstack/pull/2436 -- Daan