Comments inline: > -----Original Message----- > From: Shu Wang [mailto:[email protected]] > Sent: Tuesday, September 29, 2015 3:26 AM > To: Anders Widell; Mathivanan Naickan Palanivelu > Cc: [email protected] > Subject: RE: [users] How to correct a split-brain situation > > Thanks, Mathi and Anders. > > In the latest split brain scenario, I also observed that active controller > controller1 failed with this error message: > ER AMF director heart beat timeout, generating core for amfd > opensaf_reboot was invoked to stop and start OpenSAF on the node. > > Standby controller controller2 promoted itself to active controller. > However, when the old active controller controller1 restarted, it failed to > find > its peer controller2 within 2 seconds, then controller1 set itself to active: > No peer available => Setting Active role for this node > > IMMND in both controllers threw errors: > Sep 3 07:53:36 controller1 osafimmnd[8181]: WA MDS problem-2, giving up > Sep 3 07:53:36 controller1 osafimmnd[8181]: ER IMMND - Periodic server job > failed Sep 3 07:53:36 controller1 osafimmnd[8181]: ER Failed, exiting... > > Sep 3 07:53:36 controller2 osafimmnd[19218]: ER IMMND forced to restart > on order from IMMD, exiting Sep 3 07:53:36 controller2 osafimmd[19203]: > WA IMMND coordinator at 20b0f apparently crashed => electing new coord > Sep 3 07:53:36 controller2 osafimmd[19203]: ER Failed to find candidate for > new IMMND coordinator Sep 3 07:53:36 controller2 osafimmd[19203]: ER > Active IMMD has to restart the IMMSv. All IMMNDs will restart > > Controller2 rebooted. At the same time period, 2 payload nodes were also > restarting and they failed to start, because controller1 was starting up and > controller2 was rebooting, there was no functioning controller. > > My questions are: > > 1. When controller1 restarted, it should find active controller controller2, > but > it did not. Can we use RDE_DISCOVER_PEER_TIMEOUT to give it more time to [Mathi] Yes. The discovery latency is dependent on your network's latency!
> find its peer? What would be the cons to set RDE_DISCOVER_PEER_TIMEOUT > to a larger number? > [Mathi] In the initial cluster startup case, it would result in longer cluster role determination. In this test case, it would result in the standby taking more time to join the cluster. > 2. How to avoid to getting into the scenario that both controllers were > rebooting/restarting, and no functioning active controller in the cluster? > [Mathi] Have an external mechanism that takes down the second controller, is one option. Alternatively FM/RDE are reference implementations that can be enhanced/modified by the user. Having said that there are also some enhancements being planned in the next release. Cheers, Mathi. > Thanks! > > Shu Wang > > > -----Original Message----- > From: Anders Widell [mailto:[email protected]] > Sent: Thursday, September 24, 2015 6:35 AM > To: Mathivanan Naickan Palanivelu; Shu Wang > Cc: [email protected] > Subject: Re: [users] How to correct a split-brain situation > > Also, I must point out the importance of having a redundant network > connection between the nodes; otherwise it will be a single point of failure. > Is your network duplicated? > > / Anders Widell > > On 09/24/2015 12:21 PM, Mathivanan Naickan Palanivelu wrote: > > Hi, > > > > Note that FMS_PROMOTE_ACTIVE_TIMER and opensaf_reboot scripts are > two > > platform adaptation attributes in OpenSAF w.r.t failover and fencing. An > OpenSAF user can customize these in their deployments. > > > > Upon receiving connection loss indication with the active controller, > > the STANDBY controller starts this promote active timer (see > FMS_PROMOTE_ACTIVE_TIMER in /etc/opensaf/fmd.conf). > > This timer acts as a tolerance mechanism to handle or differentiate > > temporary link-flaps and false-positives in your network. > > Upon expiry of this timer, the STANDBY invokes opensaf_reboot script > > (with the intention to reboot the ACTIVE node) and subsequently > promotes itself to ACTIVE. > > > > The opensaf_reboot script is an integration point for the OpenSAF > > user. So, during failover when this opensaf_reboot script is invoked > > the node information (node_id, PLM ee name) of the peer ACTIVE node is > passed as input to this script. > > Inside this script, the user can modify so as to invoke 'commands' > > that will perform remote reboots of the old ACTIVE node. > > The 'commands' here could be an IPMI command or any STONITH > agent/command. > > > > Cheers, > > Mathi. > > > > ----- [email protected] wrote: > > > >> When a system gets into split-brain scenario, both controllers assume > >> active role. How does a payload node distinguish which controller it > >> is associated to? Is there a way that we find out which payload nodes > >> connect to which controller? > >> > >> Our cluster needs to provide service 24x7. So restarting the cluster > >> is not possible when this situation occurs. What is the best way to > >> correct a split-brain situation? If we stop and restart one of the > >> controller nodes to allow it to rejoin the other controller, should > >> we also restart the payload nodes associated to that controller? > >> Those payload nodes should be stopped before stopping their > >> associated controller node, correct? > >> > >> Shu Wang > >> > >> > >> > >> > >> ________________________________ > >> The information transmitted herein is intended only for the person or > >> entity to which it is addressed and may contain confidential, > >> proprietary and/or privileged material. Any review, retransmission, > >> dissemination or other use of, or taking of any action in reliance > >> upon, this information by persons or entities other than the intended > >> recipient is prohibited. If you received this in error, please > >> contact the sender and delete the material from any computer. > >> --------------------------------------------------------------------- > >> --------- Monitor Your Dynamic Infrastructure at Any Scale With > >> Datadog! > >> Get real-time metrics from all of your servers, apps and tools in one > >> place. > >> SourceForge users - Click here to start your Free Trial of Datadog > >> now! > >> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 > >> _______________________________________________ > >> Opensaf-users mailing list > >> [email protected] > >> https://lists.sourceforge.net/lists/listinfo/opensaf-users > > ---------------------------------------------------------------------- > > -------- Monitor Your Dynamic Infrastructure at Any Scale With > > Datadog! > > Get real-time metrics from all of your servers, apps and tools in one > > place. > > SourceForge users - Click here to start your Free Trial of Datadog now! > > http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 > > _______________________________________________ > > Opensaf-users mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > > > > > ________________________________ > The information transmitted herein is intended only for the person or entity > to which it is addressed and may contain confidential, proprietary and/or > privileged material. Any review, retransmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in > error, please contact the sender and delete the material from any computer. ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
