Thanks, Mathi and Anders. In the latest split brain scenario, I also observed that active controller controller1 failed with this error message: ER AMF director heart beat timeout, generating core for amfd opensaf_reboot was invoked to stop and start OpenSAF on the node.
Standby controller controller2 promoted itself to active controller. However, when the old active controller controller1 restarted, it failed to find its peer controller2 within 2 seconds, then controller1 set itself to active: No peer available => Setting Active role for this node IMMND in both controllers threw errors: Sep 3 07:53:36 controller1 osafimmnd[8181]: WA MDS problem-2, giving up Sep 3 07:53:36 controller1 osafimmnd[8181]: ER IMMND - Periodic server job failed Sep 3 07:53:36 controller1 osafimmnd[8181]: ER Failed, exiting... Sep 3 07:53:36 controller2 osafimmnd[19218]: ER IMMND forced to restart on order from IMMD, exiting Sep 3 07:53:36 controller2 osafimmd[19203]: WA IMMND coordinator at 20b0f apparently crashed => electing new coord Sep 3 07:53:36 controller2 osafimmd[19203]: ER Failed to find candidate for new IMMND coordinator Sep 3 07:53:36 controller2 osafimmd[19203]: ER Active IMMD has to restart the IMMSv. All IMMNDs will restart Controller2 rebooted. At the same time period, 2 payload nodes were also restarting and they failed to start, because controller1 was starting up and controller2 was rebooting, there was no functioning controller. My questions are: 1. When controller1 restarted, it should find active controller controller2, but it did not. Can we use RDE_DISCOVER_PEER_TIMEOUT to give it more time to find its peer? What would be the cons to set RDE_DISCOVER_PEER_TIMEOUT to a larger number? 2. How to avoid to getting into the scenario that both controllers were rebooting/restarting, and no functioning active controller in the cluster? Thanks! Shu Wang -----Original Message----- From: Anders Widell [mailto:[email protected]] Sent: Thursday, September 24, 2015 6:35 AM To: Mathivanan Naickan Palanivelu; Shu Wang Cc: [email protected] Subject: Re: [users] How to correct a split-brain situation Also, I must point out the importance of having a redundant network connection between the nodes; otherwise it will be a single point of failure. Is your network duplicated? / Anders Widell On 09/24/2015 12:21 PM, Mathivanan Naickan Palanivelu wrote: > Hi, > > Note that FMS_PROMOTE_ACTIVE_TIMER and opensaf_reboot scripts are two > platform adaptation attributes in OpenSAF w.r.t failover and fencing. An > OpenSAF user can customize these in their deployments. > > Upon receiving connection loss indication with the active controller, > the STANDBY controller starts this promote active timer (see > FMS_PROMOTE_ACTIVE_TIMER in /etc/opensaf/fmd.conf). > This timer acts as a tolerance mechanism to handle or differentiate > temporary link-flaps and false-positives in your network. > Upon expiry of this timer, the STANDBY invokes opensaf_reboot script > (with the intention to reboot the ACTIVE node) and subsequently promotes > itself to ACTIVE. > > The opensaf_reboot script is an integration point for the OpenSAF > user. So, during failover when this opensaf_reboot script is invoked > the node information (node_id, PLM ee name) of the peer ACTIVE node is passed > as input to this script. > Inside this script, the user can modify so as to invoke 'commands' > that will perform remote reboots of the old ACTIVE node. > The 'commands' here could be an IPMI command or any STONITH agent/command. > > Cheers, > Mathi. > > ----- [email protected] wrote: > >> When a system gets into split-brain scenario, both controllers assume >> active role. How does a payload node distinguish which controller it >> is associated to? Is there a way that we find out which payload nodes >> connect to which controller? >> >> Our cluster needs to provide service 24x7. So restarting the cluster >> is not possible when this situation occurs. What is the best way to >> correct a split-brain situation? If we stop and restart one of the >> controller nodes to allow it to rejoin the other controller, should >> we also restart the payload nodes associated to that controller? >> Those payload nodes should be stopped before stopping their >> associated controller node, correct? >> >> Shu Wang >> >> >> >> >> ________________________________ >> The information transmitted herein is intended only for the person or >> entity to which it is addressed and may contain confidential, >> proprietary and/or privileged material. Any review, retransmission, >> dissemination or other use of, or taking of any action in reliance >> upon, this information by persons or entities other than the intended >> recipient is prohibited. If you received this in error, please >> contact the sender and delete the material from any computer. >> --------------------------------------------------------------------- >> --------- Monitor Your Dynamic Infrastructure at Any Scale With >> Datadog! >> Get real-time metrics from all of your servers, apps and tools in one >> place. >> SourceForge users - Click here to start your Free Trial of Datadog >> now! >> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 >> _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-users > ---------------------------------------------------------------------- > -------- Monitor Your Dynamic Infrastructure at Any Scale With > Datadog! > Get real-time metrics from all of your servers, apps and tools in one > place. > SourceForge users - Click here to start your Free Trial of Datadog now! > http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users > ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
