Re: HA redundant virtual router

Sheng Yang Thu, 05 Sep 2013 15:29:30 -0700

Here is the doc.

https://cwiki.apache.org/confluence/display/CLOUDSTACK/Redundant+Virtual+Router+Functional+Spec


It's not extremely detail, but describe today's design generally.

--Sheng


On Thu, Aug 29, 2013 at 8:17 AM, Daan Hoogland <[email protected]>wrote:

> ok,
>
> let's postpone the discussion till you are at least halve done. We
> will of course continue to deliberate on what we need internally.
>
> Daan
>
> On Thu, Aug 29, 2013 at 5:08 PM, Sheng Yang <[email protected]> wrote:
> > Hi Daan,
> >
> > As I said, I am writing a design doc to describe the current redundant
> > router policy, to help understanding redundant router. Current it doesn't
> > support VPC, so how to implement it in VPC is still open to discuss.
> >
> > --Sheng
> >
> >
> > On Thu, Aug 29, 2013 at 4:26 AM, Daan Hoogland <[email protected]>
> > wrote:
> >>
> >> Sheng,
> >>
> >> just to make sure; You are going to write this document? I see Roeland
> >> understood your mail like this.
> >>
> >> When you do, I'd like you to keep in mind that we also want redundant
> >> routers within a VPC to ensure ACS upgrades are more seamless for
> >> customer application groups and - dtap streets. If you need any help
> >> on writing such a doc, let me know.
> >>
> >> kind regards,
> >> Daan
> >>
> >> On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
> >> <[email protected]> wrote:
> >> > Hi Sheng,
> >> >
> >> > Thanks for the info. Looking forward to the design doc, I trust this
> >> > will make things clearer.
> >> > In the meantime will be doing some research and thinking too, to see
> how
> >> > we can improve things to also have HA on the RvR in a safe way.
> >> > We will share this once ready.
> >> >
> >> > Thanks,
> >> > Roeland
> >> >
> >> >
> >> > From: Sheng Yang [mailto:[email protected]]
> >> > Sent: donderdag 29 augustus 2013 0:19
> >> > To: <[email protected]>
> >> > Cc: int-cloud; Daan Hoogland
> >> > Subject: Re: HA redundant virtual router
> >> >
> >> > Hi Roeland,
> >> >
> >> > I would write a design doc to explain how redundant router works
> >> > currently. For example, for the point 2, we have to force BACKUP
> become
> >> > MASTER because:
> >> >
> >> > 1. CS cannot communicate with MASTER at the time
> >> > 2. CS can communicate with BACKUP.
> >> > 3. Rule has to be programmed immediately.
> >> > 4. In case old MASTER come back, it should yield to the VR with
> updated
> >> > rule, rather than preempt the updated VR.
> >> >
> >> > In this case, CS need to communicate with RvR to program the new rule,
> >> > thus it need to intervene the RvR to ensure that if there is only one
> VR got
> >> > the rule, it should become MASTER.
> >> >
> >> > Still, I would write a doc later to try to cover every concern of RvR
> >> > design.
> >> >
> >> > --Sheng
> >> >
> >> > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers
> >> > <[email protected]<mailto:[email protected]>>
> wrote:
> >> > Hi Sheng,
> >> >
> >> > Thanks for your reply. I'll see if we can replay this scenario.
> >> >
> >> > With respect to point 1: a good principal IMHO.
> >> >
> >> > Point 2: Why do we force a keepalived node to become master and not
> wait
> >> > for keepalived to become master? This way there is less reason to
> intervene
> >> > and less risk of multiple masters? As we have seen this behavior with
> RvR
> >> > without HA in the past. The downside that updates to rules do not
> function
> >> > until backup becomes master. But maybe this is wise anyways since
> there is
> >> > something wrong. This conflicts a bit with point 2 as we do intervene
> here.
> >> >
> >> > Point 3: In my opinion keepalived is solid enough to leave this
> >> > responsibility with keepalived and that CS just should check the
> state and
> >> > not fiddle with priorities to force masters. Because there is
> obviously a
> >> > reason why BACKUP refuses to become master.
> >> > I think we should let keepalived prevent multiple master as is
> designed
> >> > to prevent this. Or do I miss something here?
> >> > Actually in the scenario you described, with a functioning guest
> >> > network, keepalived should be able to handle this situation if we
> make sure
> >> > all routers have different prios.
> >> >
> >> > I still have the opinion HA and RvR are different mechanisms.
> >> >
> >> > So what do you think is necessary to have the possibility of HA icw
> RvR?
> >> > We have a clear business requirement to have this implement on CS.
> And we
> >> > have Developers willing to create these changes to make this possible.
> >> > We also like to see RvR on VPC's and are also willing to contribute
> this
> >> > functionality.
> >> >
> >> > Thanks for your feedback!
> >> >
> >> > Cheers,
> >> > Roeland
> >> >
> >> > -----Original Message-----
> >> > From: Sheng Yang [mailto:[email protected]<mailto:[email protected]>]
> >> > Sent: vrijdag 23 augustus 2013 23:25
> >> > To: <[email protected]<mailto:[email protected]>>
> >> > Subject: Re: HA redundant virtual router
> >> >
> >> > Hi Roeland,
> >> >
> >> > Thank you for your testing!
> >> >
> >> > Power off is not an concern right now, because at that time the VM
> would
> >> > disappear anyway.
> >> >
> >> > Our concern is more about if VM is still alive but we cannot detect it
> >> > for a while. For example, a network glitch happened, CS lost
> connection to
> >> > the host temporarily(control network), but the guest network is still
> >> > working.
> >> > HA would start another VR, which would possible result in 3 routers in
> >> > the guest network(at least for a moment). Many of the policy focus on
> >> > dealing these intermediate status. Also if you plug off the network
> cable of
> >> > one host many things should happen...
> >> >
> >> >
> >> > In RvR we want to make sure:
> >> > 1. The status are self-governed, no need for CS to intervene.
> >> > 2. MASTER would always get the latest rules. That means, if we cannot
> >> > communicate with MASTER, we would turn to BACKUP and program the rule
> on it
> >> > and make it MASTER - even we cannot communicate with MASTER at this
> time.
> >> > And BACKUP should able to become MASTER if we request. This is
> achieved
> >> > by using a script to bump up the priority of BACKUP.
> >> > 3. Trying best to prevent the dual-MASTER situation. So we would
> program
> >> > different priority for VRs and the MASTER/BACKUP status completely
> depends
> >> > on priority.
> >> >
> >> > And if you take RvR as an alternative to VM's HA mechanism., it's not
> >> > that counter intuitive in fact.
> >> >
> >> > --Sheng
> >> >
> >> >
> >> > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
> >> > [email protected]<mailto:[email protected]>>
> wrote:
> >> >
> >> >> Hi Sheng,
> >> >>
> >> >> So far our testing showed no big problems. I've marked a redundant
> set
> >> >> of routers to be ha_enabled by setting ha_enabled bit in the
> >> >> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
> >> >> in the scenarios ,shutdown / force power off VM. In these scenarios
> HA
> >> >> worked a treat and did restore the redundant pair as it should. And
> >> >> keepalived nicely negotiated MASTER & BACKUP.
> >> >> These are obviously basic tests, but we are happy to do some more
> >> >> testing.
> >> >>
> >> >> I understand your concerns and am totally in favour of the KISS
> >> >> principle.
> >> >> What could be the scenario to end up with 3 routers?
> >> >> Why is the situation complex to deal with? These are separate
> >> >> mechanisms.
> >> >> HA just making sure the router is up and alive. And keepalived
> >> >> negotatiating MASTER-BACUP states according to keepalived
> >> >> configuration, unless there a 3 routers with conflicting configs. But
> >> >> so far I do not understand the scenario where we could end up with 3
> >> >> routers, so I cannot judge end/or test this.
> >> >>
> >> >> We like to see the hardcoded denial of HA in a redundant router setup
> >> >> go for several reasons:
> >> >> 1. It's counter intuitive - we configured an HA service offering on
> >> >> purpose for the RvR's. And found out by accident that it was not
> >> >> enabled at all.
> >> >> 2. CS could implement a default offering without HA for this setup
> (to
> >> >> keep it simple by default and keep currently forced behaviour), but
> if
> >> >> users, like us, deliberately like to have HA, users can create a
> >> >> custom offering with HA enabled
> >> >>
> >> >> This way it's configurable, doesn't change default behavior and is
> >> >> more intuitive.
> >> >>
> >> >> Thanks & Cheers,
> >> >> Roeland
> >> >>
> >> >>
> >> >>
> >> >> -----Original Message-----
> >> >> From: Sheng Yang [mailto:[email protected]<mailto:[email protected]>]
> >> >> Sent: vrijdag 23 augustus 2013 3:03
> >> >> To: <[email protected]<mailto:[email protected]>>
> >> >> Subject: Re: HA redundant virtual router
> >> >>
> >> >> It's a design choice, the only reason is it would be a very complex
> >> >> situation to deal with. In fact the redundant router itself's policy
> >> >> has already been very complex...
> >> >>
> >> >> We didn't look into details at the time of implementing redundant
> >> >> router, but there are lots of concerns e.g. a network glitch may
> >> >> result in 3 routers running in the network and potentially two of
> them
> >> >> are in MASTER state.
> >> >>
> >> >> Of course discussion is welcome. We just want to keep it as simple as
> >> >> possible at the time.
> >> >>
> >> >> --Sheng
> >> >>
> >> >>
> >> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
> >> >> [email protected]<mailto:[email protected]>
> >> >> > wrote:
> >> >>
> >> >> > LS,
> >> >> >
> >> >> > Schuberg Philis guarantees 100% functional uptime for their
> >> >> > customers.
> >> >> > Infrastructure is of course part of this promise and the easier
> >> >> > factor to provide strong levels of resiliency. For this reason we
> >> >> > want to make use of redundant virtual routers together with HA
> >> >> > functionality.
> >> >> >
> >> >> > We see HA and redundant routers as to different methods to provide
> >> >> > higher levels of uptime.
> >> >> >
> >> >> >
> >> >> > 1.      The redundant router setup takes care of seamless failover
> >> >> without
> >> >> > lengthy hick-ups in the case of a single router failure.
> >> >> >
> >> >> > 2.      HA takes care of restarting a failed VM or router.
> Restoring
> >> >> > connectivity in the case of single router or restoring 2n
> resiliency
> >> >> > in the case of a redundant router setup.
> >> >> >
> >> >> > The combination of these two methods will help us to meet our 100%
> >> >> > promise; .We need to restore 2N redundancy ASAP in the case of
> >> >> > single component failure e.g. a router. With these two methods
> >> >> > combined the system is more autonomous and doesn't need human
> >> >> > intervention to restore redundancy.
> >> >> >
> >> >> > In the current situation we need to send a page to an on call
> >> >> > engineer to restore redundancy asap, because of the tight SLA's.
> >> >> > While if we could use HA icw redundant routers. The on-call guy can
> >> >> > enjoy his sleep and will be a more happy guy :) The present code
> >> >> > forces the HA offering to off on redundant routers which seems odd.
> >> >> >
> >> >> > So my question is: Why is it forced to off; Is there a technical
> >> >> > restraint or is this a design choice we can discuss and maybe
> revise?
> >> >> >
> >> >> > Cheers,
> >> >> >
> >> >> >
> >> >>
> >> >
> >
> >
>

Re: HA redundant virtual router

Reply via email to