Hi Sheng,

Thanks for your reply. I'll see if we can replay this scenario.

With respect to point 1: a good principal IMHO.

Point 2: Why do we force a keepalived node to become master and not wait for 
keepalived to become master? This way there is less reason to intervene and 
less risk of multiple masters? As we have seen this behavior with RvR without 
HA in the past. The downside that updates to rules do not function until backup 
becomes master. But maybe this is wise anyways since there is something wrong. 
This conflicts a bit with point 2 as we do intervene here.

Point 3: In my opinion keepalived is solid enough to leave this responsibility 
with keepalived and that CS just should check the state and not fiddle with 
priorities to force masters. Because there is obviously a reason why BACKUP 
refuses to become master.
I think we should let keepalived prevent multiple master as is designed to 
prevent this. Or do I miss something here?
Actually in the scenario you described, with a functioning guest network, 
keepalived should be able to handle this situation if we make sure all routers 
have different prios. 

I still have the opinion HA and RvR are different mechanisms.

So what do you think is necessary to have the possibility of HA icw RvR? We 
have a clear business requirement to have this implement on CS. And we have 
Developers willing to create these changes to make this possible.
We also like to see RvR on VPC's and are also willing to contribute this 
functionality.

Thanks for your feedback!

Cheers,
Roeland

-----Original Message-----
From: Sheng Yang [mailto:sh...@yasker.org] 
Sent: vrijdag 23 augustus 2013 23:25
To: <dev@cloudstack.apache.org>
Subject: Re: HA redundant virtual router

Hi Roeland,

Thank you for your testing!

Power off is not an concern right now, because at that time the VM would 
disappear anyway.

Our concern is more about if VM is still alive but we cannot detect it for a 
while. For example, a network glitch happened, CS lost connection to the host 
temporarily(control network), but the guest network is still working.
HA would start another VR, which would possible result in 3 routers in the 
guest network(at least for a moment). Many of the policy focus on dealing these 
intermediate status. Also if you plug off the network cable of one host many 
things should happen...


In RvR we want to make sure:
1. The status are self-governed, no need for CS to intervene.
2. MASTER would always get the latest rules. That means, if we cannot 
communicate with MASTER, we would turn to BACKUP and program the rule on it and 
make it MASTER - even we cannot communicate with MASTER at this time.
And BACKUP should able to become MASTER if we request. This is achieved by 
using a script to bump up the priority of BACKUP.
3. Trying best to prevent the dual-MASTER situation. So we would program 
different priority for VRs and the MASTER/BACKUP status completely depends on 
priority.

And if you take RvR as an alternative to VM's HA mechanism., it's not that 
counter intuitive in fact.

--Sheng


On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers < rkuip...@schubergphilis.com> 
wrote:

> Hi Sheng,
>
> So far our testing showed no big problems. I've marked a redundant set 
> of routers to be ha_enabled by setting ha_enabled bit in the 
> vm_instance table. (This is our workaround ATM) We tested HA icw RvR 
> in the scenarios ,shutdown / force power off VM. In these scenarios HA 
> worked a treat and did restore the redundant pair as it should. And 
> keepalived nicely negotiated MASTER & BACKUP.
> These are obviously basic tests, but we are happy to do some more testing.
>
> I understand your concerns and am totally in favour of the KISS principle.
> What could be the scenario to end up with 3 routers?
> Why is the situation complex to deal with? These are separate mechanisms.
> HA just making sure the router is up and alive. And keepalived 
> negotatiating MASTER-BACUP states according to keepalived 
> configuration, unless there a 3 routers with conflicting configs. But 
> so far I do not understand the scenario where we could end up with 3 
> routers, so I cannot judge end/or test this.
>
> We like to see the hardcoded denial of HA in a redundant router setup 
> go for several reasons:
> 1. It's counter intuitive - we configured an HA service offering on 
> purpose for the RvR's. And found out by accident that it was not 
> enabled at all.
> 2. CS could implement a default offering without HA for this setup (to 
> keep it simple by default and keep currently forced behaviour), but if 
> users, like us, deliberately like to have HA, users can create a 
> custom offering with HA enabled
>
> This way it's configurable, doesn't change default behavior and is 
> more intuitive.
>
> Thanks & Cheers,
> Roeland
>
>
>
> -----Original Message-----
> From: Sheng Yang [mailto:sh...@yasker.org]
> Sent: vrijdag 23 augustus 2013 3:03
> To: <dev@cloudstack.apache.org>
> Subject: Re: HA redundant virtual router
>
> It's a design choice, the only reason is it would be a very complex 
> situation to deal with. In fact the redundant router itself's policy 
> has already been very complex...
>
> We didn't look into details at the time of implementing redundant 
> router, but there are lots of concerns e.g. a network glitch may 
> result in 3 routers running in the network and potentially two of them 
> are in MASTER state.
>
> Of course discussion is welcome. We just want to keep it as simple as 
> possible at the time.
>
> --Sheng
>
>
> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland < 
> dhoogl...@schubergphilis.com
> > wrote:
>
> > LS,
> >
> > Schuberg Philis guarantees 100% functional uptime for their customers.
> > Infrastructure is of course part of this promise and the easier 
> > factor to provide strong levels of resiliency. For this reason we 
> > want to make use of redundant virtual routers together with HA 
> > functionality.
> >
> > We see HA and redundant routers as to different methods to provide 
> > higher levels of uptime.
> >
> >
> > 1.      The redundant router setup takes care of seamless failover
> without
> > lengthy hick-ups in the case of a single router failure.
> >
> > 2.      HA takes care of restarting a failed VM or router. Restoring
> > connectivity in the case of single router or restoring 2n resiliency 
> > in the case of a redundant router setup.
> >
> > The combination of these two methods will help us to meet our 100% 
> > promise; .We need to restore 2N redundancy ASAP in the case of 
> > single component failure e.g. a router. With these two methods 
> > combined the system is more autonomous and doesn't need human 
> > intervention to restore redundancy.
> >
> > In the current situation we need to send a page to an on call 
> > engineer to restore redundancy asap, because of the tight SLA's. 
> > While if we could use HA icw redundant routers. The on-call guy can 
> > enjoy his sleep and will be a more happy guy :) The present code 
> > forces the HA offering to off on redundant routers which seems odd.
> >
> > So my question is: Why is it forced to off; Is there a technical 
> > restraint or is this a design choice we can discuss and maybe revise?
> >
> > Cheers,
> >
> >
>

Reply via email to