Sheng,

just to make sure; You are going to write this document? I see Roeland
understood your mail like this.

When you do, I'd like you to keep in mind that we also want redundant
routers within a VPC to ensure ACS upgrades are more seamless for
customer application groups and - dtap streets. If you need any help
on writing such a doc, let me know.

kind regards,
Daan

On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
<rkuip...@schubergphilis.com> wrote:
> Hi Sheng,
>
> Thanks for the info. Looking forward to the design doc, I trust this will 
> make things clearer.
> In the meantime will be doing some research and thinking too, to see how we 
> can improve things to also have HA on the RvR in a safe way.
> We will share this once ready.
>
> Thanks,
> Roeland
>
>
> From: Sheng Yang [mailto:sh...@yasker.org]
> Sent: donderdag 29 augustus 2013 0:19
> To: <dev@cloudstack.apache.org>
> Cc: int-cloud; Daan Hoogland
> Subject: Re: HA redundant virtual router
>
> Hi Roeland,
>
> I would write a design doc to explain how redundant router works currently. 
> For example, for the point 2, we have to force BACKUP become MASTER because:
>
> 1. CS cannot communicate with MASTER at the time
> 2. CS can communicate with BACKUP.
> 3. Rule has to be programmed immediately.
> 4. In case old MASTER come back, it should yield to the VR with updated rule, 
> rather than preempt the updated VR.
>
> In this case, CS need to communicate with RvR to program the new rule, thus 
> it need to intervene the RvR to ensure that if there is only one VR got the 
> rule, it should become MASTER.
>
> Still, I would write a doc later to try to cover every concern of RvR design.
>
> --Sheng
>
> On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers 
> <rkuip...@schubergphilis.com<mailto:rkuip...@schubergphilis.com>> wrote:
> Hi Sheng,
>
> Thanks for your reply. I'll see if we can replay this scenario.
>
> With respect to point 1: a good principal IMHO.
>
> Point 2: Why do we force a keepalived node to become master and not wait for 
> keepalived to become master? This way there is less reason to intervene and 
> less risk of multiple masters? As we have seen this behavior with RvR without 
> HA in the past. The downside that updates to rules do not function until 
> backup becomes master. But maybe this is wise anyways since there is 
> something wrong. This conflicts a bit with point 2 as we do intervene here.
>
> Point 3: In my opinion keepalived is solid enough to leave this 
> responsibility with keepalived and that CS just should check the state and 
> not fiddle with priorities to force masters. Because there is obviously a 
> reason why BACKUP refuses to become master.
> I think we should let keepalived prevent multiple master as is designed to 
> prevent this. Or do I miss something here?
> Actually in the scenario you described, with a functioning guest network, 
> keepalived should be able to handle this situation if we make sure all 
> routers have different prios.
>
> I still have the opinion HA and RvR are different mechanisms.
>
> So what do you think is necessary to have the possibility of HA icw RvR? We 
> have a clear business requirement to have this implement on CS. And we have 
> Developers willing to create these changes to make this possible.
> We also like to see RvR on VPC's and are also willing to contribute this 
> functionality.
>
> Thanks for your feedback!
>
> Cheers,
> Roeland
>
> -----Original Message-----
> From: Sheng Yang [mailto:sh...@yasker.org<mailto:sh...@yasker.org>]
> Sent: vrijdag 23 augustus 2013 23:25
> To: <dev@cloudstack.apache.org<mailto:dev@cloudstack.apache.org>>
> Subject: Re: HA redundant virtual router
>
> Hi Roeland,
>
> Thank you for your testing!
>
> Power off is not an concern right now, because at that time the VM would 
> disappear anyway.
>
> Our concern is more about if VM is still alive but we cannot detect it for a 
> while. For example, a network glitch happened, CS lost connection to the host 
> temporarily(control network), but the guest network is still working.
> HA would start another VR, which would possible result in 3 routers in the 
> guest network(at least for a moment). Many of the policy focus on dealing 
> these intermediate status. Also if you plug off the network cable of one host 
> many things should happen...
>
>
> In RvR we want to make sure:
> 1. The status are self-governed, no need for CS to intervene.
> 2. MASTER would always get the latest rules. That means, if we cannot 
> communicate with MASTER, we would turn to BACKUP and program the rule on it 
> and make it MASTER - even we cannot communicate with MASTER at this time.
> And BACKUP should able to become MASTER if we request. This is achieved by 
> using a script to bump up the priority of BACKUP.
> 3. Trying best to prevent the dual-MASTER situation. So we would program 
> different priority for VRs and the MASTER/BACKUP status completely depends on 
> priority.
>
> And if you take RvR as an alternative to VM's HA mechanism., it's not that 
> counter intuitive in fact.
>
> --Sheng
>
>
> On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers < 
> rkuip...@schubergphilis.com<mailto:rkuip...@schubergphilis.com>> wrote:
>
>> Hi Sheng,
>>
>> So far our testing showed no big problems. I've marked a redundant set
>> of routers to be ha_enabled by setting ha_enabled bit in the
>> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
>> in the scenarios ,shutdown / force power off VM. In these scenarios HA
>> worked a treat and did restore the redundant pair as it should. And
>> keepalived nicely negotiated MASTER & BACKUP.
>> These are obviously basic tests, but we are happy to do some more testing.
>>
>> I understand your concerns and am totally in favour of the KISS principle.
>> What could be the scenario to end up with 3 routers?
>> Why is the situation complex to deal with? These are separate mechanisms.
>> HA just making sure the router is up and alive. And keepalived
>> negotatiating MASTER-BACUP states according to keepalived
>> configuration, unless there a 3 routers with conflicting configs. But
>> so far I do not understand the scenario where we could end up with 3
>> routers, so I cannot judge end/or test this.
>>
>> We like to see the hardcoded denial of HA in a redundant router setup
>> go for several reasons:
>> 1. It's counter intuitive - we configured an HA service offering on
>> purpose for the RvR's. And found out by accident that it was not
>> enabled at all.
>> 2. CS could implement a default offering without HA for this setup (to
>> keep it simple by default and keep currently forced behaviour), but if
>> users, like us, deliberately like to have HA, users can create a
>> custom offering with HA enabled
>>
>> This way it's configurable, doesn't change default behavior and is
>> more intuitive.
>>
>> Thanks & Cheers,
>> Roeland
>>
>>
>>
>> -----Original Message-----
>> From: Sheng Yang [mailto:sh...@yasker.org<mailto:sh...@yasker.org>]
>> Sent: vrijdag 23 augustus 2013 3:03
>> To: <dev@cloudstack.apache.org<mailto:dev@cloudstack.apache.org>>
>> Subject: Re: HA redundant virtual router
>>
>> It's a design choice, the only reason is it would be a very complex
>> situation to deal with. In fact the redundant router itself's policy
>> has already been very complex...
>>
>> We didn't look into details at the time of implementing redundant
>> router, but there are lots of concerns e.g. a network glitch may
>> result in 3 routers running in the network and potentially two of them
>> are in MASTER state.
>>
>> Of course discussion is welcome. We just want to keep it as simple as
>> possible at the time.
>>
>> --Sheng
>>
>>
>> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
>> dhoogl...@schubergphilis.com<mailto:dhoogl...@schubergphilis.com>
>> > wrote:
>>
>> > LS,
>> >
>> > Schuberg Philis guarantees 100% functional uptime for their customers.
>> > Infrastructure is of course part of this promise and the easier
>> > factor to provide strong levels of resiliency. For this reason we
>> > want to make use of redundant virtual routers together with HA 
>> > functionality.
>> >
>> > We see HA and redundant routers as to different methods to provide
>> > higher levels of uptime.
>> >
>> >
>> > 1.      The redundant router setup takes care of seamless failover
>> without
>> > lengthy hick-ups in the case of a single router failure.
>> >
>> > 2.      HA takes care of restarting a failed VM or router. Restoring
>> > connectivity in the case of single router or restoring 2n resiliency
>> > in the case of a redundant router setup.
>> >
>> > The combination of these two methods will help us to meet our 100%
>> > promise; .We need to restore 2N redundancy ASAP in the case of
>> > single component failure e.g. a router. With these two methods
>> > combined the system is more autonomous and doesn't need human
>> > intervention to restore redundancy.
>> >
>> > In the current situation we need to send a page to an on call
>> > engineer to restore redundancy asap, because of the tight SLA's.
>> > While if we could use HA icw redundant routers. The on-call guy can
>> > enjoy his sleep and will be a more happy guy :) The present code
>> > forces the HA offering to off on redundant routers which seems odd.
>> >
>> > So my question is: Why is it forced to off; Is there a technical
>> > restraint or is this a design choice we can discuss and maybe revise?
>> >
>> > Cheers,
>> >
>> >
>>
>

Reply via email to