Sheng, just to make sure; You are going to write this document? I see Roeland understood your mail like this.
When you do, I'd like you to keep in mind that we also want redundant routers within a VPC to ensure ACS upgrades are more seamless for customer application groups and - dtap streets. If you need any help on writing such a doc, let me know. kind regards, Daan On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers <rkuip...@schubergphilis.com> wrote: > Hi Sheng, > > Thanks for the info. Looking forward to the design doc, I trust this will > make things clearer. > In the meantime will be doing some research and thinking too, to see how we > can improve things to also have HA on the RvR in a safe way. > We will share this once ready. > > Thanks, > Roeland > > > From: Sheng Yang [mailto:sh...@yasker.org] > Sent: donderdag 29 augustus 2013 0:19 > To: <dev@cloudstack.apache.org> > Cc: int-cloud; Daan Hoogland > Subject: Re: HA redundant virtual router > > Hi Roeland, > > I would write a design doc to explain how redundant router works currently. > For example, for the point 2, we have to force BACKUP become MASTER because: > > 1. CS cannot communicate with MASTER at the time > 2. CS can communicate with BACKUP. > 3. Rule has to be programmed immediately. > 4. In case old MASTER come back, it should yield to the VR with updated rule, > rather than preempt the updated VR. > > In this case, CS need to communicate with RvR to program the new rule, thus > it need to intervene the RvR to ensure that if there is only one VR got the > rule, it should become MASTER. > > Still, I would write a doc later to try to cover every concern of RvR design. > > --Sheng > > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers > <rkuip...@schubergphilis.com<mailto:rkuip...@schubergphilis.com>> wrote: > Hi Sheng, > > Thanks for your reply. I'll see if we can replay this scenario. > > With respect to point 1: a good principal IMHO. > > Point 2: Why do we force a keepalived node to become master and not wait for > keepalived to become master? This way there is less reason to intervene and > less risk of multiple masters? As we have seen this behavior with RvR without > HA in the past. The downside that updates to rules do not function until > backup becomes master. But maybe this is wise anyways since there is > something wrong. This conflicts a bit with point 2 as we do intervene here. > > Point 3: In my opinion keepalived is solid enough to leave this > responsibility with keepalived and that CS just should check the state and > not fiddle with priorities to force masters. Because there is obviously a > reason why BACKUP refuses to become master. > I think we should let keepalived prevent multiple master as is designed to > prevent this. Or do I miss something here? > Actually in the scenario you described, with a functioning guest network, > keepalived should be able to handle this situation if we make sure all > routers have different prios. > > I still have the opinion HA and RvR are different mechanisms. > > So what do you think is necessary to have the possibility of HA icw RvR? We > have a clear business requirement to have this implement on CS. And we have > Developers willing to create these changes to make this possible. > We also like to see RvR on VPC's and are also willing to contribute this > functionality. > > Thanks for your feedback! > > Cheers, > Roeland > > -----Original Message----- > From: Sheng Yang [mailto:sh...@yasker.org<mailto:sh...@yasker.org>] > Sent: vrijdag 23 augustus 2013 23:25 > To: <dev@cloudstack.apache.org<mailto:dev@cloudstack.apache.org>> > Subject: Re: HA redundant virtual router > > Hi Roeland, > > Thank you for your testing! > > Power off is not an concern right now, because at that time the VM would > disappear anyway. > > Our concern is more about if VM is still alive but we cannot detect it for a > while. For example, a network glitch happened, CS lost connection to the host > temporarily(control network), but the guest network is still working. > HA would start another VR, which would possible result in 3 routers in the > guest network(at least for a moment). Many of the policy focus on dealing > these intermediate status. Also if you plug off the network cable of one host > many things should happen... > > > In RvR we want to make sure: > 1. The status are self-governed, no need for CS to intervene. > 2. MASTER would always get the latest rules. That means, if we cannot > communicate with MASTER, we would turn to BACKUP and program the rule on it > and make it MASTER - even we cannot communicate with MASTER at this time. > And BACKUP should able to become MASTER if we request. This is achieved by > using a script to bump up the priority of BACKUP. > 3. Trying best to prevent the dual-MASTER situation. So we would program > different priority for VRs and the MASTER/BACKUP status completely depends on > priority. > > And if you take RvR as an alternative to VM's HA mechanism., it's not that > counter intuitive in fact. > > --Sheng > > > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers < > rkuip...@schubergphilis.com<mailto:rkuip...@schubergphilis.com>> wrote: > >> Hi Sheng, >> >> So far our testing showed no big problems. I've marked a redundant set >> of routers to be ha_enabled by setting ha_enabled bit in the >> vm_instance table. (This is our workaround ATM) We tested HA icw RvR >> in the scenarios ,shutdown / force power off VM. In these scenarios HA >> worked a treat and did restore the redundant pair as it should. And >> keepalived nicely negotiated MASTER & BACKUP. >> These are obviously basic tests, but we are happy to do some more testing. >> >> I understand your concerns and am totally in favour of the KISS principle. >> What could be the scenario to end up with 3 routers? >> Why is the situation complex to deal with? These are separate mechanisms. >> HA just making sure the router is up and alive. And keepalived >> negotatiating MASTER-BACUP states according to keepalived >> configuration, unless there a 3 routers with conflicting configs. But >> so far I do not understand the scenario where we could end up with 3 >> routers, so I cannot judge end/or test this. >> >> We like to see the hardcoded denial of HA in a redundant router setup >> go for several reasons: >> 1. It's counter intuitive - we configured an HA service offering on >> purpose for the RvR's. And found out by accident that it was not >> enabled at all. >> 2. CS could implement a default offering without HA for this setup (to >> keep it simple by default and keep currently forced behaviour), but if >> users, like us, deliberately like to have HA, users can create a >> custom offering with HA enabled >> >> This way it's configurable, doesn't change default behavior and is >> more intuitive. >> >> Thanks & Cheers, >> Roeland >> >> >> >> -----Original Message----- >> From: Sheng Yang [mailto:sh...@yasker.org<mailto:sh...@yasker.org>] >> Sent: vrijdag 23 augustus 2013 3:03 >> To: <dev@cloudstack.apache.org<mailto:dev@cloudstack.apache.org>> >> Subject: Re: HA redundant virtual router >> >> It's a design choice, the only reason is it would be a very complex >> situation to deal with. In fact the redundant router itself's policy >> has already been very complex... >> >> We didn't look into details at the time of implementing redundant >> router, but there are lots of concerns e.g. a network glitch may >> result in 3 routers running in the network and potentially two of them >> are in MASTER state. >> >> Of course discussion is welcome. We just want to keep it as simple as >> possible at the time. >> >> --Sheng >> >> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland < >> dhoogl...@schubergphilis.com<mailto:dhoogl...@schubergphilis.com> >> > wrote: >> >> > LS, >> > >> > Schuberg Philis guarantees 100% functional uptime for their customers. >> > Infrastructure is of course part of this promise and the easier >> > factor to provide strong levels of resiliency. For this reason we >> > want to make use of redundant virtual routers together with HA >> > functionality. >> > >> > We see HA and redundant routers as to different methods to provide >> > higher levels of uptime. >> > >> > >> > 1. The redundant router setup takes care of seamless failover >> without >> > lengthy hick-ups in the case of a single router failure. >> > >> > 2. HA takes care of restarting a failed VM or router. Restoring >> > connectivity in the case of single router or restoring 2n resiliency >> > in the case of a redundant router setup. >> > >> > The combination of these two methods will help us to meet our 100% >> > promise; .We need to restore 2N redundancy ASAP in the case of >> > single component failure e.g. a router. With these two methods >> > combined the system is more autonomous and doesn't need human >> > intervention to restore redundancy. >> > >> > In the current situation we need to send a page to an on call >> > engineer to restore redundancy asap, because of the tight SLA's. >> > While if we could use HA icw redundant routers. The on-call guy can >> > enjoy his sleep and will be a more happy guy :) The present code >> > forces the HA offering to off on redundant routers which seems odd. >> > >> > So my question is: Why is it forced to off; Is there a technical >> > restraint or is this a design choice we can discuss and maybe revise? >> > >> > Cheers, >> > >> > >> >