Re: [DISCUSS] VR upgrade downtime reduction

2018-05-01 Thread Simon Weller
Yes, nice work!





From: Daan Hoogland 
Sent: Tuesday, May 1, 2018 5:28 AM
To: us...@cloudstack.apache.org
Cc: dev
Subject: Re: [DISCUSS] VR upgrade downtime reduction

good work Rohit,
I'll review 2508 https://github.com/apache/cloudstack/pull/2508

On Tue, May 1, 2018 at 12:08 PM, Rohit Yadav 
wrote:

> All,
>
>
> A short-term solution to VR upgrade or network restart (with cleanup=true)
> has been implemented:
>
>
> - The strategy for redundant VRs builds on top of Wei's original patch
> where backup routers are removed and replace in a rolling basis. The
> downtime I saw was usually 0-2 seconds, and theoretically downtime is
> maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds
> (with cloudstack's default of 1s advertisement interval).
>
>
> - For non-redundant routers, I've implemented a strategy where first a new
> VR is deployed, then old VR is powered-off/destroyed, and the new VR is
> again re-programmed. With this strategy, two identical VRs may be up for a
> brief moment (few seconds) where both can serve traffic, however the new VR
> performs arp-ping on its interfaces to update neighbours. After the old VR
> is removed, the new VR is re-programmed which among many things performs
> another arpping. The theoretical downtime is therefore limited by the
> arp-cache refresh which can be up to 30 seconds. In my experiments, against
> various VMware, KVM and XenServer versions I found that the downtime was
> indeed less than 30s, usually between 5-20 seconds. Compared to older ACS
> versions, especially in cases where VRs deployment require full volume copy
> (like in VMware) a 10x-12x improvement was seen.
>
>
> Please review, test the following PRs which has test details, benchmarks,
> and some screenshots:
>
> https://github.com/apache/cloudstack/pull/2508
>
>
> Future work can be driven towards making all VRs redundant enabled by
> default that can allow for a firewall+connections state transfer
> (conntrackd + VRRP2/3 based) during rolling reboots.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ____
> From: Daan Hoogland 
> Sent: Thursday, February 8, 2018 3:11:51 PM
> To: dev
> Subject: Re: [DISCUSS] VR upgrade downtime reduction
>
> to stop the vote and continue the discussion. I personally want unification
> of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the
> one we want to create for 'enterprise topology hand-off points'. And I
> think we have some level of consensus on that but the path there is a
> concern for Wido and for some of my colleagues as well, and rightly so. One
> issue is upgrades from older versions.
>
> I the common scenario as follows:
> + redundancy is deprecated and only number of instances remain.
> + an old VR is replicated in memory by an redundant enabled version, that
> will be in a state of running but inactive.
> - the old one will be destroyed while a ping is running
> - as soon as the ping fails more then three times in a row (this might have
> to have a hypervisor specific implementation or require a helper vm)
> + the new one is activated
>
> after this upgrade Wei's and/or Remi's code will do the work for any
> following upgrade.
>
> flames, please
>
>
>
> On Wed, Feb 7, 2018 at 12:17 PM, Nux!  wrote:
>
> > +1 too
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> >
> rohit.ya...@shapeblue.com
> www.shapeblue.com<http://www.shapeblue.com>
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
> - Original Message -
> > > From: "Rene Moser" 
> > > To: "dev" 
> > > Sent: Wednesday, 7 February, 2018 10:11:45
> > > Subject: Re: [DISCUSS] VR upgrade downtime reduction
> >
> > > On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> > >> Hi Daan,
> > >>
> > >> In my opinion the biggest issue is the fact that there are a lot of
> > different
> > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's
> > why you
> > >> cannot simply switch from a single VPC to a redundant VPC for example.
> > >>
> > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> > VPC with a
> > >> single tier and made sure all features are supported. Next we merged
> > the single
> > >> and redundant VPC code paths. The idea here is that redundancy or not
> > should
> > >> only be a differ

Re: [DISCUSS] VR upgrade downtime reduction

2018-05-01 Thread Daan Hoogland
good work Rohit,
I'll review 2508 https://github.com/apache/cloudstack/pull/2508

On Tue, May 1, 2018 at 12:08 PM, Rohit Yadav 
wrote:

> All,
>
>
> A short-term solution to VR upgrade or network restart (with cleanup=true)
> has been implemented:
>
>
> - The strategy for redundant VRs builds on top of Wei's original patch
> where backup routers are removed and replace in a rolling basis. The
> downtime I saw was usually 0-2 seconds, and theoretically downtime is
> maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds
> (with cloudstack's default of 1s advertisement interval).
>
>
> - For non-redundant routers, I've implemented a strategy where first a new
> VR is deployed, then old VR is powered-off/destroyed, and the new VR is
> again re-programmed. With this strategy, two identical VRs may be up for a
> brief moment (few seconds) where both can serve traffic, however the new VR
> performs arp-ping on its interfaces to update neighbours. After the old VR
> is removed, the new VR is re-programmed which among many things performs
> another arpping. The theoretical downtime is therefore limited by the
> arp-cache refresh which can be up to 30 seconds. In my experiments, against
> various VMware, KVM and XenServer versions I found that the downtime was
> indeed less than 30s, usually between 5-20 seconds. Compared to older ACS
> versions, especially in cases where VRs deployment require full volume copy
> (like in VMware) a 10x-12x improvement was seen.
>
>
> Please review, test the following PRs which has test details, benchmarks,
> and some screenshots:
>
> https://github.com/apache/cloudstack/pull/2508
>
>
> Future work can be driven towards making all VRs redundant enabled by
> default that can allow for a firewall+connections state transfer
> (conntrackd + VRRP2/3 based) during rolling reboots.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ____
> From: Daan Hoogland 
> Sent: Thursday, February 8, 2018 3:11:51 PM
> To: dev
> Subject: Re: [DISCUSS] VR upgrade downtime reduction
>
> to stop the vote and continue the discussion. I personally want unification
> of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the
> one we want to create for 'enterprise topology hand-off points'. And I
> think we have some level of consensus on that but the path there is a
> concern for Wido and for some of my colleagues as well, and rightly so. One
> issue is upgrades from older versions.
>
> I the common scenario as follows:
> + redundancy is deprecated and only number of instances remain.
> + an old VR is replicated in memory by an redundant enabled version, that
> will be in a state of running but inactive.
> - the old one will be destroyed while a ping is running
> - as soon as the ping fails more then three times in a row (this might have
> to have a hypervisor specific implementation or require a helper vm)
> + the new one is activated
>
> after this upgrade Wei's and/or Remi's code will do the work for any
> following upgrade.
>
> flames, please
>
>
>
> On Wed, Feb 7, 2018 at 12:17 PM, Nux!  wrote:
>
> > +1 too
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> >
> rohit.ya...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
> - Original Message -
> > > From: "Rene Moser" 
> > > To: "dev" 
> > > Sent: Wednesday, 7 February, 2018 10:11:45
> > > Subject: Re: [DISCUSS] VR upgrade downtime reduction
> >
> > > On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> > >> Hi Daan,
> > >>
> > >> In my opinion the biggest issue is the fact that there are a lot of
> > different
> > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's
> > why you
> > >> cannot simply switch from a single VPC to a redundant VPC for example.
> > >>
> > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> > VPC with a
> > >> single tier and made sure all features are supported. Next we merged
> > the single
> > >> and redundant VPC code paths. The idea here is that redundancy or not
> > should
> > >> only be a difference in the number of routers. Code should be the
> same.
> > A
> > >> single router, is also "master" but there just is no "backup".
> > >>
> > >> That simplifies thi

Re: [DISCUSS] VR upgrade downtime reduction

2018-05-01 Thread Rohit Yadav
All,


A short-term solution to VR upgrade or network restart (with cleanup=true) has 
been implemented:


- The strategy for redundant VRs builds on top of Wei's original patch where 
backup routers are removed and replace in a rolling basis. The downtime I saw 
was usually 0-2 seconds, and theoretically downtime is maximum of [0, 
3*advertisement interval + skew seconds] or 0-10 seconds (with cloudstack's 
default of 1s advertisement interval).


- For non-redundant routers, I've implemented a strategy where first a new VR 
is deployed, then old VR is powered-off/destroyed, and the new VR is again 
re-programmed. With this strategy, two identical VRs may be up for a brief 
moment (few seconds) where both can serve traffic, however the new VR performs 
arp-ping on its interfaces to update neighbours. After the old VR is removed, 
the new VR is re-programmed which among many things performs another arpping. 
The theoretical downtime is therefore limited by the arp-cache refresh which 
can be up to 30 seconds. In my experiments, against various VMware, KVM and 
XenServer versions I found that the downtime was indeed less than 30s, usually 
between 5-20 seconds. Compared to older ACS versions, especially in cases where 
VRs deployment require full volume copy (like in VMware) a 10x-12x improvement 
was seen.


Please review, test the following PRs which has test details, benchmarks, and 
some screenshots:

https://github.com/apache/cloudstack/pull/2508


Future work can be driven towards making all VRs redundant enabled by default 
that can allow for a firewall+connections state transfer (conntrackd + VRRP2/3 
based) during rolling reboots.


- Rohit

<https://cloudstack.apache.org>




From: Daan Hoogland 
Sent: Thursday, February 8, 2018 3:11:51 PM
To: dev
Subject: Re: [DISCUSS] VR upgrade downtime reduction

to stop the vote and continue the discussion. I personally want unification
of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the
one we want to create for 'enterprise topology hand-off points'. And I
think we have some level of consensus on that but the path there is a
concern for Wido and for some of my colleagues as well, and rightly so. One
issue is upgrades from older versions.

I the common scenario as follows:
+ redundancy is deprecated and only number of instances remain.
+ an old VR is replicated in memory by an redundant enabled version, that
will be in a state of running but inactive.
- the old one will be destroyed while a ping is running
- as soon as the ping fails more then three times in a row (this might have
to have a hypervisor specific implementation or require a helper vm)
+ the new one is activated

after this upgrade Wei's and/or Remi's code will do the work for any
following upgrade.

flames, please



On Wed, Feb 7, 2018 at 12:17 PM, Nux!  wrote:

> +1 too
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> 
rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

- Original Message -
> > From: "Rene Moser" 
> > To: "dev" 
> > Sent: Wednesday, 7 February, 2018 10:11:45
> > Subject: Re: [DISCUSS] VR upgrade downtime reduction
>
> > On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> >> Hi Daan,
> >>
> >> In my opinion the biggest issue is the fact that there are a lot of
> different
> >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's
> why you
> >> cannot simply switch from a single VPC to a redundant VPC for example.
> >>
> >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> VPC with a
> >> single tier and made sure all features are supported. Next we merged
> the single
> >> and redundant VPC code paths. The idea here is that redundancy or not
> should
> >> only be a difference in the number of routers. Code should be the same.
> A
> >> single router, is also "master" but there just is no "backup".
> >>
> >> That simplifies things A LOT, as keepalived is now the master of the
> whole
> >> thing. No more assigning ip addresses in Python, but leave that to
> keepalived
> >> instead. Lots of code deleted. Easier to maintain, way more stable. We
> just
> >> released Cosmic 6 that has this feature and are now rolling it out in
> >> production. Looking good so far. This change unlocks a lot of
> possibilities,
> >> like live upgrading from a single VPC to a redundant one (and back). In
> the
> >> end, if the redundant VPC is rock solid, you most likely don't even
> want single
> >> VPCs any more. But that will come.
> >>
> >> As I said, we're rolling this out as we speak. In a few weeks when
> everything is
> >> upgraded I can share what we learned and how well it works. CloudStack
> could
> >> use a similar approach.
> >
> > +1 Pretty much this.
> >
> > René
>



--
Daan


Re: [DISCUSS] VR upgrade downtime reduction

2018-02-08 Thread Daan Hoogland
to stop the vote and continue the discussion. I personally want unification
of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the
one we want to create for 'enterprise topology hand-off points'. And I
think we have some level of consensus on that but the path there is a
concern for Wido and for some of my colleagues as well, and rightly so. One
issue is upgrades from older versions.

I the common scenario as follows:
+ redundancy is deprecated and only number of instances remain.
+ an old VR is replicated in memory by an redundant enabled version, that
will be in a state of running but inactive.
- the old one will be destroyed while a ping is running
- as soon as the ping fails more then three times in a row (this might have
to have a hypervisor specific implementation or require a helper vm)
+ the new one is activated

after this upgrade Wei's and/or Remi's code will do the work for any
following upgrade.

flames, please



On Wed, Feb 7, 2018 at 12:17 PM, Nux!  wrote:

> +1 too
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> - Original Message -
> > From: "Rene Moser" 
> > To: "dev" 
> > Sent: Wednesday, 7 February, 2018 10:11:45
> > Subject: Re: [DISCUSS] VR upgrade downtime reduction
>
> > On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> >> Hi Daan,
> >>
> >> In my opinion the biggest issue is the fact that there are a lot of
> different
> >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's
> why you
> >> cannot simply switch from a single VPC to a redundant VPC for example.
> >>
> >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> VPC with a
> >> single tier and made sure all features are supported. Next we merged
> the single
> >> and redundant VPC code paths. The idea here is that redundancy or not
> should
> >> only be a difference in the number of routers. Code should be the same.
> A
> >> single router, is also "master" but there just is no "backup".
> >>
> >> That simplifies things A LOT, as keepalived is now the master of the
> whole
> >> thing. No more assigning ip addresses in Python, but leave that to
> keepalived
> >> instead. Lots of code deleted. Easier to maintain, way more stable. We
> just
> >> released Cosmic 6 that has this feature and are now rolling it out in
> >> production. Looking good so far. This change unlocks a lot of
> possibilities,
> >> like live upgrading from a single VPC to a redundant one (and back). In
> the
> >> end, if the redundant VPC is rock solid, you most likely don't even
> want single
> >> VPCs any more. But that will come.
> >>
> >> As I said, we're rolling this out as we speak. In a few weeks when
> everything is
> >> upgraded I can share what we learned and how well it works. CloudStack
> could
> >> use a similar approach.
> >
> > +1 Pretty much this.
> >
> > René
>



-- 
Daan


Re: [DISCUSS] VR upgrade downtime reduction

2018-02-07 Thread Nux!
+1 too

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

- Original Message -
> From: "Rene Moser" 
> To: "dev" 
> Sent: Wednesday, 7 February, 2018 10:11:45
> Subject: Re: [DISCUSS] VR upgrade downtime reduction

> On 02/06/2018 02:47 PM, Remi Bergsma wrote:
>> Hi Daan,
>> 
>> In my opinion the biggest issue is the fact that there are a lot of different
>> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's why you
>> cannot simply switch from a single VPC to a redundant VPC for example.
>> 
>> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC 
>> with a
>> single tier and made sure all features are supported. Next we merged the 
>> single
>> and redundant VPC code paths. The idea here is that redundancy or not should
>> only be a difference in the number of routers. Code should be the same. A
>> single router, is also "master" but there just is no "backup".
>> 
>> That simplifies things A LOT, as keepalived is now the master of the whole
>> thing. No more assigning ip addresses in Python, but leave that to keepalived
>> instead. Lots of code deleted. Easier to maintain, way more stable. We just
>> released Cosmic 6 that has this feature and are now rolling it out in
>> production. Looking good so far. This change unlocks a lot of possibilities,
>> like live upgrading from a single VPC to a redundant one (and back). In the
>> end, if the redundant VPC is rock solid, you most likely don't even want 
>> single
>> VPCs any more. But that will come.
>> 
>> As I said, we're rolling this out as we speak. In a few weeks when 
>> everything is
>> upgraded I can share what we learned and how well it works. CloudStack could
>> use a similar approach.
> 
> +1 Pretty much this.
> 
> René


Re: [DISCUSS] VR upgrade downtime reduction

2018-02-07 Thread Rene Moser
On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> Hi Daan,
> 
> In my opinion the biggest issue is the fact that there are a lot of different 
> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's why you 
> cannot simply switch from a single VPC to a redundant VPC for example. 
> 
> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC with 
> a single tier and made sure all features are supported. Next we merged the 
> single and redundant VPC code paths. The idea here is that redundancy or not 
> should only be a difference in the number of routers. Code should be the 
> same. A single router, is also "master" but there just is no "backup".
> 
> That simplifies things A LOT, as keepalived is now the master of the whole 
> thing. No more assigning ip addresses in Python, but leave that to keepalived 
> instead. Lots of code deleted. Easier to maintain, way more stable. We just 
> released Cosmic 6 that has this feature and are now rolling it out in 
> production. Looking good so far. This change unlocks a lot of possibilities, 
> like live upgrading from a single VPC to a redundant one (and back). In the 
> end, if the redundant VPC is rock solid, you most likely don't even want 
> single VPCs any more. But that will come.
> 
> As I said, we're rolling this out as we speak. In a few weeks when everything 
> is upgraded I can share what we learned and how well it works. CloudStack 
> could use a similar approach.

+1 Pretty much this.

René


Re: [DISCUSS] VR upgrade downtime reduction

2018-02-07 Thread Rafael Weingärtner
 ONE-VR approach in ACS 5.0. It is time to plan for a major release and
break some things...

On Wed, Feb 7, 2018 at 7:17 AM, Paul Angus  wrote:

> It seems sensible to me to have ONE VR, and I like the idea of that we all
> VRs are 'redundant-ready', again supporting the ONE-VR approach.
>
> The question I have is:
>
> - how do we handle the transition - does it need ACS 5.0?
> The API and the UI separate the VR and the VPC, so what is the most
> logical presentation of the proposed solution to the users/operators.
>
>
> Kind regards,
>
> Paul Angus
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>
> -Original Message-
> From: Daan Hoogland [mailto:daan.hoogl...@gmail.com]
> Sent: 07 February 2018 08:58
> To: dev 
> Subject: Re: [DISCUSS] VR upgrade downtime reduction
>
> Reading all the reactions I am getting wary of all the possible solutions
> that we have.
>  We do have a fragile VR and Remi's way seems the only one to stabilise it.
> It also answers the question on which of my two tactics we should follow.
>  Wido's abjection may be valid but services that are not started are not
> crashing and thus should not hinder him.
>  As for Wei's changes I think the most important one is in the PR I ported
> forward to master, using his older commit. I metntioned it in
> > ​[1] https://github.com/apache/cloudstack/pull/2435​
> I am looking forward to any of your PRs as well Wei.
>
>  Making all VRs redundant is a bit of a hack and the biggest risk in it is
> making sure that only one will get started.
>
> ​ There is one point I'd like consensus on; We have only one system
> template and we are well served by letting it have only one form as VR. ​Do
> we agree on that?
>
> ​comments, flames, questions, ​regards,​
>
>
> On Tue, Feb 6, 2018 at 9:04 PM, Wei ZHOU  wrote:
>
> > Hi Remi,
> >
> > Actually in our fork, there are more changes than restartnetwork and
> > restart vpc, similar as your changes.
> > (1) edit networks from offering with single VR to offerings with RVR,
> > will hack VR (set new guest IP, start keepalived and conntrackd,
> > blablabla)
> > (2) restart vpc from single VR to RVR. similar changes will be made.
> > The downtime is around 5s. However, these changes are based 4.7.1, we
> > are not sure if it still work in 4.11
> >
> > We have lots of changes , we will port the changes to 4.11 LTS and
> > create PRs in the next months.
> >
> > -Wei
> >
> >
> > 2018-02-06 14:47 GMT+01:00 Remi Bergsma :
> >
> > > Hi Daan,
> > >
> > > In my opinion the biggest issue is the fact that there are a lot of
> > > different code paths: VPC versus non-VPC, VPC versus redundant-VPC,
> etc.
> > > That's why you cannot simply switch from a single VPC to a redundant
> > > VPC for example.
> > >
> > > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> > > VPC with a single tier and made sure all features are supported.
> > > Next we
> > merged
> > > the single and redundant VPC code paths. The idea here is that
> > > redundancy or not should only be a difference in the number of
> > > routers. Code should
> > be
> > > the same. A single router, is also "master" but there just is no
> > "backup".
> > >
> > > That simplifies things A LOT, as keepalived is now the master of the
> > whole
> > > thing. No more assigning ip addresses in Python, but leave that to
> > > keepalived instead. Lots of code deleted. Easier to maintain, way
> > > more stable. We just released Cosmic 6 that has this feature and are
> > > now
> > rolling
> > > it out in production. Looking good so far. This change unlocks a lot
> > > of possibilities, like live upgrading from a single VPC to a
> > > redundant one (and back). In the end, if the redundant VPC is rock
> > > solid, you most
> > likely
> > > don't even want single VPCs any more. But that will come.
> > >
> > > As I said, we're rolling this out as we speak. In a few weeks when
> > > everything is upgraded I can share what we learned and how well it
> works.
> > > CloudStack could use a similar approach.
> > >
> > > Kind Regards,
> > > Remi
> > >
> > >
> > >
> > > On 05/02/2018, 16:44, "Daan Hoogland" 
> wrote:
> > >
> > > H devs,
> > >
> >

RE: [DISCUSS] VR upgrade downtime reduction

2018-02-07 Thread Paul Angus
It seems sensible to me to have ONE VR, and I like the idea of that we all VRs 
are 'redundant-ready', again supporting the ONE-VR approach.

The question I have is:

- how do we handle the transition - does it need ACS 5.0?
The API and the UI separate the VR and the VPC, so what is the most logical 
presentation of the proposed solution to the users/operators.


Kind regards,

Paul Angus

paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-Original Message-
From: Daan Hoogland [mailto:daan.hoogl...@gmail.com] 
Sent: 07 February 2018 08:58
To: dev 
Subject: Re: [DISCUSS] VR upgrade downtime reduction

Reading all the reactions I am getting wary of all the possible solutions that 
we have.
 We do have a fragile VR and Remi's way seems the only one to stabilise it.
It also answers the question on which of my two tactics we should follow.
 Wido's abjection may be valid but services that are not started are not 
crashing and thus should not hinder him.
 As for Wei's changes I think the most important one is in the PR I ported 
forward to master, using his older commit. I metntioned it in
> ​[1] https://github.com/apache/cloudstack/pull/2435​
I am looking forward to any of your PRs as well Wei.

 Making all VRs redundant is a bit of a hack and the biggest risk in it is 
making sure that only one will get started.

​ There is one point I'd like consensus on; We have only one system template 
and we are well served by letting it have only one form as VR. ​Do we agree on 
that?

​comments, flames, questions, ​regards,​


On Tue, Feb 6, 2018 at 9:04 PM, Wei ZHOU  wrote:

> Hi Remi,
>
> Actually in our fork, there are more changes than restartnetwork and 
> restart vpc, similar as your changes.
> (1) edit networks from offering with single VR to offerings with RVR, 
> will hack VR (set new guest IP, start keepalived and conntrackd, 
> blablabla)
> (2) restart vpc from single VR to RVR. similar changes will be made.
> The downtime is around 5s. However, these changes are based 4.7.1, we 
> are not sure if it still work in 4.11
>
> We have lots of changes , we will port the changes to 4.11 LTS and 
> create PRs in the next months.
>
> -Wei
>
>
> 2018-02-06 14:47 GMT+01:00 Remi Bergsma :
>
> > Hi Daan,
> >
> > In my opinion the biggest issue is the fact that there are a lot of 
> > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc.
> > That's why you cannot simply switch from a single VPC to a redundant 
> > VPC for example.
> >
> > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a 
> > VPC with a single tier and made sure all features are supported. 
> > Next we
> merged
> > the single and redundant VPC code paths. The idea here is that 
> > redundancy or not should only be a difference in the number of 
> > routers. Code should
> be
> > the same. A single router, is also "master" but there just is no
> "backup".
> >
> > That simplifies things A LOT, as keepalived is now the master of the
> whole
> > thing. No more assigning ip addresses in Python, but leave that to 
> > keepalived instead. Lots of code deleted. Easier to maintain, way 
> > more stable. We just released Cosmic 6 that has this feature and are 
> > now
> rolling
> > it out in production. Looking good so far. This change unlocks a lot 
> > of possibilities, like live upgrading from a single VPC to a 
> > redundant one (and back). In the end, if the redundant VPC is rock 
> > solid, you most
> likely
> > don't even want single VPCs any more. But that will come.
> >
> > As I said, we're rolling this out as we speak. In a few weeks when 
> > everything is upgraded I can share what we learned and how well it works.
> > CloudStack could use a similar approach.
> >
> > Kind Regards,
> > Remi
> >
> >
> >
> > On 05/02/2018, 16:44, "Daan Hoogland"  wrote:
> >
> > H devs,
> >
> > I have recently (re-)submitted two PRs, one by Wei [1] and one 
> > by
> Remi
> > [2],
> > that reduce downtime for redundant routers and redundant VPCs 
> > respectively.
> > (please review those)
> > Now from customers we hear that they also want to reduce downtime for
> > regular VRs so as we discussed this we came to two possible 
> > solutions that
> > we want to implement one of:
> >
> > 1. start and configure a new router before destroying the old 
> > one and then
> > as a last minute action stop the old one.
> > 2. make all routers start up redundancy se

Re: [DISCUSS] VR upgrade downtime reduction

2018-02-07 Thread Daan Hoogland
Reading all the reactions I am getting wary of all the possible solutions
that we have.
 We do have a fragile VR and Remi's way seems the only one to stabilise it.
It also answers the question on which of my two tactics we should follow.
 Wido's abjection may be valid but services that are not started are not
crashing and thus should not hinder him.
 As for Wei's changes I think the most important one is in the PR I ported
forward to master, using his older commit. I metntioned it in
> ​[1] https://github.com/apache/cloudstack/pull/2435​
I am looking forward to any of your PRs as well Wei.

 Making all VRs redundant is a bit of a hack and the biggest risk in it is
making sure that only one will get started.

​ There is one point I'd like consensus on; We have only one system
template and we are well served by letting it have only one form as VR. ​Do
we agree on that?

​comments, flames, questions, ​regards,​


On Tue, Feb 6, 2018 at 9:04 PM, Wei ZHOU  wrote:

> Hi Remi,
>
> Actually in our fork, there are more changes than restartnetwork and
> restart vpc, similar as your changes.
> (1) edit networks from offering with single VR to offerings with RVR, will
> hack VR (set new guest IP, start keepalived and conntrackd, blablabla)
> (2) restart vpc from single VR to RVR. similar changes will be made.
> The downtime is around 5s. However, these changes are based 4.7.1, we are
> not sure if it still work in 4.11
>
> We have lots of changes , we will port the changes to 4.11 LTS and create
> PRs in the next months.
>
> -Wei
>
>
> 2018-02-06 14:47 GMT+01:00 Remi Bergsma :
>
> > Hi Daan,
> >
> > In my opinion the biggest issue is the fact that there are a lot of
> > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc.
> > That's why you cannot simply switch from a single VPC to a redundant VPC
> > for example.
> >
> > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC
> > with a single tier and made sure all features are supported. Next we
> merged
> > the single and redundant VPC code paths. The idea here is that redundancy
> > or not should only be a difference in the number of routers. Code should
> be
> > the same. A single router, is also "master" but there just is no
> "backup".
> >
> > That simplifies things A LOT, as keepalived is now the master of the
> whole
> > thing. No more assigning ip addresses in Python, but leave that to
> > keepalived instead. Lots of code deleted. Easier to maintain, way more
> > stable. We just released Cosmic 6 that has this feature and are now
> rolling
> > it out in production. Looking good so far. This change unlocks a lot of
> > possibilities, like live upgrading from a single VPC to a redundant one
> > (and back). In the end, if the redundant VPC is rock solid, you most
> likely
> > don't even want single VPCs any more. But that will come.
> >
> > As I said, we're rolling this out as we speak. In a few weeks when
> > everything is upgraded I can share what we learned and how well it works.
> > CloudStack could use a similar approach.
> >
> > Kind Regards,
> > Remi
> >
> >
> >
> > On 05/02/2018, 16:44, "Daan Hoogland"  wrote:
> >
> > H devs,
> >
> > I have recently (re-)submitted two PRs, one by Wei [1] and one by
> Remi
> > [2],
> > that reduce downtime for redundant routers and redundant VPCs
> > respectively.
> > (please review those)
> > Now from customers we hear that they also want to reduce downtime for
> > regular VRs so as we discussed this we came to two possible solutions
> > that
> > we want to implement one of:
> >
> > 1. start and configure a new router before destroying the old one and
> > then
> > as a last minute action stop the old one.
> > 2. make all routers start up redundancy services but for regular
> > routers
> > start only one until an upgrade is required at which time a new,
> second
> > router can be started before killing the old one.​
> >
> > ​obviously both solutions have their merits, so I want to have your
> > input
> > to make the broadest supported implementation.
> > -1 means there will be an overlap or a small delay and interruption
> of
> > service.
> > +1 It can be argued, "they got what they payed for".
> > -2 means a overhead in memory usage by the router by the extra
> services
> > running on it.
> > +2 the number of router-varieties will be further reduced.
> >
> > -1&-2 We have to deal with potentially large upgrade steps from way
> > before
> > the cloudstack era even and might be stuck to 1 because of that,
> > needing to
> > hack around it. Any dealing with older VRs, pre 4.5 and especially
> pre
> > 4.0
> > will be hard.
> >
> > I am not cross posting though this might be one of these occasions
> > where it
> > is appropriate to include users@. Just my puristic inhibitions.
> >
> > Of course I have preferences but can you share your thoughts, please?
> > ​
> > ​And 

Re: [DISCUSS] VR upgrade downtime reduction

2018-02-06 Thread Wei ZHOU
Hi Remi,

Actually in our fork, there are more changes than restartnetwork and
restart vpc, similar as your changes.
(1) edit networks from offering with single VR to offerings with RVR, will
hack VR (set new guest IP, start keepalived and conntrackd, blablabla)
(2) restart vpc from single VR to RVR. similar changes will be made.
The downtime is around 5s. However, these changes are based 4.7.1, we are
not sure if it still work in 4.11

We have lots of changes , we will port the changes to 4.11 LTS and create
PRs in the next months.

-Wei


2018-02-06 14:47 GMT+01:00 Remi Bergsma :

> Hi Daan,
>
> In my opinion the biggest issue is the fact that there are a lot of
> different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc.
> That's why you cannot simply switch from a single VPC to a redundant VPC
> for example.
>
> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC
> with a single tier and made sure all features are supported. Next we merged
> the single and redundant VPC code paths. The idea here is that redundancy
> or not should only be a difference in the number of routers. Code should be
> the same. A single router, is also "master" but there just is no "backup".
>
> That simplifies things A LOT, as keepalived is now the master of the whole
> thing. No more assigning ip addresses in Python, but leave that to
> keepalived instead. Lots of code deleted. Easier to maintain, way more
> stable. We just released Cosmic 6 that has this feature and are now rolling
> it out in production. Looking good so far. This change unlocks a lot of
> possibilities, like live upgrading from a single VPC to a redundant one
> (and back). In the end, if the redundant VPC is rock solid, you most likely
> don't even want single VPCs any more. But that will come.
>
> As I said, we're rolling this out as we speak. In a few weeks when
> everything is upgraded I can share what we learned and how well it works.
> CloudStack could use a similar approach.
>
> Kind Regards,
> Remi
>
>
>
> On 05/02/2018, 16:44, "Daan Hoogland"  wrote:
>
> H devs,
>
> I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi
> [2],
> that reduce downtime for redundant routers and redundant VPCs
> respectively.
> (please review those)
> Now from customers we hear that they also want to reduce downtime for
> regular VRs so as we discussed this we came to two possible solutions
> that
> we want to implement one of:
>
> 1. start and configure a new router before destroying the old one and
> then
> as a last minute action stop the old one.
> 2. make all routers start up redundancy services but for regular
> routers
> start only one until an upgrade is required at which time a new, second
> router can be started before killing the old one.​
>
> ​obviously both solutions have their merits, so I want to have your
> input
> to make the broadest supported implementation.
> -1 means there will be an overlap or a small delay and interruption of
> service.
> +1 It can be argued, "they got what they payed for".
> -2 means a overhead in memory usage by the router by the extra services
> running on it.
> +2 the number of router-varieties will be further reduced.
>
> -1&-2 We have to deal with potentially large upgrade steps from way
> before
> the cloudstack era even and might be stuck to 1 because of that,
> needing to
> hack around it. Any dealing with older VRs, pre 4.5 and especially pre
> 4.0
> will be hard.
>
> I am not cross posting though this might be one of these occasions
> where it
> is appropriate to include users@. Just my puristic inhibitions.
>
> Of course I have preferences but can you share your thoughts, please?
> ​
> ​And don't forget to review Wei's [1] and Remi's [2] work please.
>
> ​[1] https://github.com/apache/cloudstack/pull/2435​
> [2] https://github.com/apache/cloudstack/pull/2436
>
> --
> Daan
>
>
>


Re: [DISCUSS] VR upgrade downtime reduction

2018-02-06 Thread Daan Hoogland
looking forward to your blog(s), Remi. sound like you guys are still having
fun.

PS did you review your PR, i submitted for you ;) ?

On Tue, Feb 6, 2018 at 2:47 PM, Remi Bergsma 
wrote:

> Hi Daan,
>
> In my opinion the biggest issue is the fact that there are a lot of
> different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc.
> That's why you cannot simply switch from a single VPC to a redundant VPC
> for example.
>
> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC
> with a single tier and made sure all features are supported. Next we merged
> the single and redundant VPC code paths. The idea here is that redundancy
> or not should only be a difference in the number of routers. Code should be
> the same. A single router, is also "master" but there just is no "backup".
>
> That simplifies things A LOT, as keepalived is now the master of the whole
> thing. No more assigning ip addresses in Python, but leave that to
> keepalived instead. Lots of code deleted. Easier to maintain, way more
> stable. We just released Cosmic 6 that has this feature and are now rolling
> it out in production. Looking good so far. This change unlocks a lot of
> possibilities, like live upgrading from a single VPC to a redundant one
> (and back). In the end, if the redundant VPC is rock solid, you most likely
> don't even want single VPCs any more. But that will come.
>
> As I said, we're rolling this out as we speak. In a few weeks when
> everything is upgraded I can share what we learned and how well it works.
> CloudStack could use a similar approach.
>
> Kind Regards,
> Remi
>
>
>
> On 05/02/2018, 16:44, "Daan Hoogland"  wrote:
>
> H devs,
>
> I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi
> [2],
> that reduce downtime for redundant routers and redundant VPCs
> respectively.
> (please review those)
> Now from customers we hear that they also want to reduce downtime for
> regular VRs so as we discussed this we came to two possible solutions
> that
> we want to implement one of:
>
> 1. start and configure a new router before destroying the old one and
> then
> as a last minute action stop the old one.
> 2. make all routers start up redundancy services but for regular
> routers
> start only one until an upgrade is required at which time a new, second
> router can be started before killing the old one.​
>
> ​obviously both solutions have their merits, so I want to have your
> input
> to make the broadest supported implementation.
> -1 means there will be an overlap or a small delay and interruption of
> service.
> +1 It can be argued, "they got what they payed for".
> -2 means a overhead in memory usage by the router by the extra services
> running on it.
> +2 the number of router-varieties will be further reduced.
>
> -1&-2 We have to deal with potentially large upgrade steps from way
> before
> the cloudstack era even and might be stuck to 1 because of that,
> needing to
> hack around it. Any dealing with older VRs, pre 4.5 and especially pre
> 4.0
> will be hard.
>
> I am not cross posting though this might be one of these occasions
> where it
> is appropriate to include users@. Just my puristic inhibitions.
>
> Of course I have preferences but can you share your thoughts, please?
> ​
> ​And don't forget to review Wei's [1] and Remi's [2] work please.
>
> ​[1] https://github.com/apache/cloudstack/pull/2435​
> [2] https://github.com/apache/cloudstack/pull/2436
>
> --
> Daan
>
>
>


-- 
Daan


Re: [DISCUSS] VR upgrade downtime reduction

2018-02-06 Thread Remi Bergsma
Hi Daan,

In my opinion the biggest issue is the fact that there are a lot of different 
code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's why you 
cannot simply switch from a single VPC to a redundant VPC for example. 

For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC with a 
single tier and made sure all features are supported. Next we merged the single 
and redundant VPC code paths. The idea here is that redundancy or not should 
only be a difference in the number of routers. Code should be the same. A 
single router, is also "master" but there just is no "backup".

That simplifies things A LOT, as keepalived is now the master of the whole 
thing. No more assigning ip addresses in Python, but leave that to keepalived 
instead. Lots of code deleted. Easier to maintain, way more stable. We just 
released Cosmic 6 that has this feature and are now rolling it out in 
production. Looking good so far. This change unlocks a lot of possibilities, 
like live upgrading from a single VPC to a redundant one (and back). In the 
end, if the redundant VPC is rock solid, you most likely don't even want single 
VPCs any more. But that will come.

As I said, we're rolling this out as we speak. In a few weeks when everything 
is upgraded I can share what we learned and how well it works. CloudStack could 
use a similar approach.
 
Kind Regards,
Remi



On 05/02/2018, 16:44, "Daan Hoogland"  wrote:

H devs,

I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2],
that reduce downtime for redundant routers and redundant VPCs respectively.
(please review those)
Now from customers we hear that they also want to reduce downtime for
regular VRs so as we discussed this we came to two possible solutions that
we want to implement one of:

1. start and configure a new router before destroying the old one and then
as a last minute action stop the old one.
2. make all routers start up redundancy services but for regular routers
start only one until an upgrade is required at which time a new, second
router can be started before killing the old one.​

​obviously both solutions have their merits, so I want to have your input
to make the broadest supported implementation.
-1 means there will be an overlap or a small delay and interruption of
service.
+1 It can be argued, "they got what they payed for".
-2 means a overhead in memory usage by the router by the extra services
running on it.
+2 the number of router-varieties will be further reduced.

-1&-2 We have to deal with potentially large upgrade steps from way before
the cloudstack era even and might be stuck to 1 because of that, needing to
hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0
will be hard.

I am not cross posting though this might be one of these occasions where it
is appropriate to include users@. Just my puristic inhibitions.

Of course I have preferences but can you share your thoughts, please?
​
​And don't forget to review Wei's [1] and Remi's [2] work please.

​[1] https://github.com/apache/cloudstack/pull/2435​
[2] https://github.com/apache/cloudstack/pull/2436

-- 
Daan




Re: [DISCUSS] VR upgrade downtime reduction

2018-02-06 Thread Wido den Hollander



On 02/06/2018 12:28 PM, Daan Hoogland wrote:

I'm afraid I don't agree on some of your comments, Wido.

On Tue, Feb 6, 2018 at 12:03 PM, Wido den Hollander  wrote:




On 02/05/2018 04:44 PM, Daan Hoogland wrote:


H devs,

I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi
[2],
that reduce downtime for redundant routers and redundant VPCs
respectively.
(please review those)
Now from customers we hear that they also want to reduce downtime for
regular VRs so as we discussed this we came to two possible solutions that
we want to implement one of:

1. start and configure a new router before destroying the old one and then
as a last minute action stop the old one.



Seems like a simple solution to me, this wouldn't require a lot of changes
in the VR.


​expect add in a stop moment just before activating, that doesn't exist yet.
​


Ah, yes. But it would mean additional tests and parameters. Not that 
it's impossible though.


The VR is already fragile imho and could use a lot more love. Adding 
more features might break things which we currently have. That's my fear 
of working on them.







2. make all routers start up redundancy services but for regular routers

start only one until an upgrade is required at which time a new, second
router can be started before killing the old one.​



True, but that would be a problem as you would need to script a lot in the
VR.


​all the scripts for rvr are already on the systemvm
​


Ah, yes, for the VPC, I forgot that.









​obviously both solutions have their merits, so I want to have your input
to make the broadest supported implementation.
-1 means there will be an overlap or a small delay and interruption of
service.
+1 It can be argued, "they got what they payed for".
-2 means a overhead in memory usage by the router by the extra services
running on it.
+2 the number of router-varieties will be further reduced.

-1&-2 We have to deal with potentially large upgrade steps from way before
the cloudstack era even and might be stuck to 1 because of that, needing
to
hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0
will be hard.



I don't like hacking. The VRs already are 'hacky' imho.


​yes, it is.​




We (PCextreme) are only using Basic Networking so for us the VR only does
DHCP and Cloud-init, so we don't care about this that much ;)


​thanks for the input anyway, Wido


I think however that it's a valid point. The Redundant Virtual Router is 
mostly important when you have traffic flowing through it.


So for Basic Networking it's less important or for a setup where traffic 
isn't going through the VR and it only does DHCP, am I correct?


Wido


​




Wido


I am not cross posting though this might be one of these occasions where it

is appropriate to include users@. Just my puristic inhibitions.

Of course I have preferences but can you share your thoughts, please?
​
​And don't forget to review Wei's [1] and Remi's [2] work please.

​[1] https://github.com/apache/cloudstack/pull/2435​
[2] https://github.com/apache/cloudstack/pull/2436







Re: [DISCUSS] VR upgrade downtime reduction

2018-02-06 Thread Daan Hoogland
I'm afraid I don't agree on some of your comments, Wido.

On Tue, Feb 6, 2018 at 12:03 PM, Wido den Hollander  wrote:

>
>
> On 02/05/2018 04:44 PM, Daan Hoogland wrote:
>
>> H devs,
>>
>> I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi
>> [2],
>> that reduce downtime for redundant routers and redundant VPCs
>> respectively.
>> (please review those)
>> Now from customers we hear that they also want to reduce downtime for
>> regular VRs so as we discussed this we came to two possible solutions that
>> we want to implement one of:
>>
>> 1. start and configure a new router before destroying the old one and then
>> as a last minute action stop the old one.
>>
>
> Seems like a simple solution to me, this wouldn't require a lot of changes
> in the VR.
>
​expect add in a stop moment just before activating, that doesn't exist yet.
​


>
> 2. make all routers start up redundancy services but for regular routers
>> start only one until an upgrade is required at which time a new, second
>> router can be started before killing the old one.​
>>
>
> True, but that would be a problem as you would need to script a lot in the
> VR.

​all the scripts for rvr are already on the systemvm
​


>
>
>
>> ​obviously both solutions have their merits, so I want to have your input
>> to make the broadest supported implementation.
>> -1 means there will be an overlap or a small delay and interruption of
>> service.
>> +1 It can be argued, "they got what they payed for".
>> -2 means a overhead in memory usage by the router by the extra services
>> running on it.
>> +2 the number of router-varieties will be further reduced.
>>
>> -1&-2 We have to deal with potentially large upgrade steps from way before
>> the cloudstack era even and might be stuck to 1 because of that, needing
>> to
>> hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0
>> will be hard.
>>
>>
> I don't like hacking. The VRs already are 'hacky' imho.
>
​yes, it is.​


>
> We (PCextreme) are only using Basic Networking so for us the VR only does
> DHCP and Cloud-init, so we don't care about this that much ;)
>
​thanks for the input anyway, Wido
​


>
> Wido
>
>
> I am not cross posting though this might be one of these occasions where it
>> is appropriate to include users@. Just my puristic inhibitions.
>>
>> Of course I have preferences but can you share your thoughts, please?
>> ​
>> ​And don't forget to review Wei's [1] and Remi's [2] work please.
>>
>> ​[1] https://github.com/apache/cloudstack/pull/2435​
>> [2] https://github.com/apache/cloudstack/pull/2436
>>
>>


-- 
Daan


Re: [DISCUSS] VR upgrade downtime reduction

2018-02-06 Thread Wido den Hollander



On 02/05/2018 04:44 PM, Daan Hoogland wrote:

H devs,

I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2],
that reduce downtime for redundant routers and redundant VPCs respectively.
(please review those)
Now from customers we hear that they also want to reduce downtime for
regular VRs so as we discussed this we came to two possible solutions that
we want to implement one of:

1. start and configure a new router before destroying the old one and then
as a last minute action stop the old one.


Seems like a simple solution to me, this wouldn't require a lot of 
changes in the VR.



2. make all routers start up redundancy services but for regular routers
start only one until an upgrade is required at which time a new, second
router can be started before killing the old one.​


True, but that would be a problem as you would need to script a lot in 
the VR.




​obviously both solutions have their merits, so I want to have your input
to make the broadest supported implementation.
-1 means there will be an overlap or a small delay and interruption of
service.
+1 It can be argued, "they got what they payed for".
-2 means a overhead in memory usage by the router by the extra services
running on it.
+2 the number of router-varieties will be further reduced.

-1&-2 We have to deal with potentially large upgrade steps from way before
the cloudstack era even and might be stuck to 1 because of that, needing to
hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0
will be hard.



I don't like hacking. The VRs already are 'hacky' imho.

We (PCextreme) are only using Basic Networking so for us the VR only 
does DHCP and Cloud-init, so we don't care about this that much ;)


Wido


I am not cross posting though this might be one of these occasions where it
is appropriate to include users@. Just my puristic inhibitions.

Of course I have preferences but can you share your thoughts, please?
​
​And don't forget to review Wei's [1] and Remi's [2] work please.

​[1] https://github.com/apache/cloudstack/pull/2435​
[2] https://github.com/apache/cloudstack/pull/2436



[DISCUSS] VR upgrade downtime reduction

2018-02-05 Thread Daan Hoogland
H devs,

I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2],
that reduce downtime for redundant routers and redundant VPCs respectively.
(please review those)
Now from customers we hear that they also want to reduce downtime for
regular VRs so as we discussed this we came to two possible solutions that
we want to implement one of:

1. start and configure a new router before destroying the old one and then
as a last minute action stop the old one.
2. make all routers start up redundancy services but for regular routers
start only one until an upgrade is required at which time a new, second
router can be started before killing the old one.​

​obviously both solutions have their merits, so I want to have your input
to make the broadest supported implementation.
-1 means there will be an overlap or a small delay and interruption of
service.
+1 It can be argued, "they got what they payed for".
-2 means a overhead in memory usage by the router by the extra services
running on it.
+2 the number of router-varieties will be further reduced.

-1&-2 We have to deal with potentially large upgrade steps from way before
the cloudstack era even and might be stuck to 1 because of that, needing to
hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0
will be hard.

I am not cross posting though this might be one of these occasions where it
is appropriate to include users@. Just my puristic inhibitions.

Of course I have preferences but can you share your thoughts, please?
​
​And don't forget to review Wei's [1] and Remi's [2] work please.

​[1] https://github.com/apache/cloudstack/pull/2435​
[2] https://github.com/apache/cloudstack/pull/2436

-- 
Daan