Thanks, Mathi and Anders.

In the latest split brain scenario, I also observed that active controller 
controller1 failed with this error message:
ER AMF director heart beat timeout, generating core for amfd
opensaf_reboot was invoked to stop and start OpenSAF on the node.

Standby controller controller2 promoted itself to active controller.
However, when the old active controller controller1 restarted, it failed to 
find its peer controller2 within 2 seconds, then controller1 set itself to 
active:
No peer available => Setting Active role for this node

IMMND in both controllers threw errors:
Sep  3 07:53:36 controller1 osafimmnd[8181]: WA MDS problem-2, giving up
Sep  3 07:53:36 controller1 osafimmnd[8181]: ER IMMND - Periodic server job 
failed
Sep  3 07:53:36 controller1 osafimmnd[8181]: ER Failed, exiting...

Sep  3 07:53:36 controller2 osafimmnd[19218]: ER IMMND forced to restart on 
order from IMMD, exiting
Sep  3 07:53:36 controller2 osafimmd[19203]: WA IMMND coordinator at 20b0f 
apparently crashed => electing new coord
Sep  3 07:53:36 controller2 osafimmd[19203]: ER Failed to find candidate for 
new IMMND coordinator
Sep  3 07:53:36 controller2 osafimmd[19203]: ER Active IMMD has to restart the 
IMMSv. All IMMNDs will restart

Controller2 rebooted. At the same time period, 2 payload nodes were also 
restarting and they failed to start, because controller1 was starting up and 
controller2 was rebooting, there was no functioning controller.

My questions are:

1. When controller1 restarted, it should find active controller controller2, 
but it did not. Can we use RDE_DISCOVER_PEER_TIMEOUT to give it more time to 
find its peer? What would be the cons to set RDE_DISCOVER_PEER_TIMEOUT to a 
larger number?

2. How to avoid to getting into the scenario that both controllers were 
rebooting/restarting, and no functioning active controller in the cluster?

Thanks!

Shu Wang


-----Original Message-----
From: Anders Widell [mailto:[email protected]]
Sent: Thursday, September 24, 2015 6:35 AM
To: Mathivanan Naickan Palanivelu; Shu Wang
Cc: [email protected]
Subject: Re: [users] How to correct a split-brain situation

Also, I must point out the importance of having a redundant network connection 
between the nodes; otherwise it will be a single point of failure. Is your 
network duplicated?

/ Anders Widell

On 09/24/2015 12:21 PM, Mathivanan Naickan Palanivelu wrote:
> Hi,
>
> Note that FMS_PROMOTE_ACTIVE_TIMER and opensaf_reboot scripts are two
> platform adaptation attributes in OpenSAF w.r.t failover and fencing. An 
> OpenSAF user can customize these in their deployments.
>
> Upon receiving connection loss indication with the active controller,
> the STANDBY controller starts this promote active timer (see 
> FMS_PROMOTE_ACTIVE_TIMER in /etc/opensaf/fmd.conf).
> This timer acts as a tolerance mechanism to handle or differentiate
> temporary link-flaps and false-positives in your network.
> Upon expiry of this timer, the STANDBY invokes opensaf_reboot script
> (with the intention to reboot the ACTIVE node) and subsequently promotes 
> itself to ACTIVE.
>
> The opensaf_reboot script is an integration point for the OpenSAF
> user. So, during failover when this opensaf_reboot script is invoked
> the node information (node_id, PLM ee name) of the peer ACTIVE node is passed 
> as input to this script.
> Inside this script, the user can modify so as to invoke 'commands'
> that will perform remote reboots of the old ACTIVE node.
> The 'commands' here could be an IPMI command or any STONITH agent/command.
>
> Cheers,
> Mathi.
>
> ----- [email protected] wrote:
>
>> When a system gets into split-brain scenario, both controllers assume
>> active role. How does a payload node distinguish which controller it
>> is associated to? Is there a way that we find out which payload nodes
>> connect to which controller?
>>
>> Our cluster needs to provide service 24x7.  So restarting the cluster
>> is not possible when this situation occurs.  What is the best way to
>> correct a split-brain situation? If we stop and restart one of the
>> controller nodes to allow it to rejoin the other controller, should
>> we also restart the payload nodes associated to that controller?
>> Those payload nodes should be stopped before stopping their
>> associated controller node, correct?
>>
>> Shu Wang
>>
>>
>>
>>
>> ________________________________
>> The information transmitted herein is intended only for the person or
>> entity to which it is addressed and may contain confidential,
>> proprietary and/or privileged material. Any review, retransmission,
>> dissemination or other use of, or taking of any action in reliance
>> upon, this information by persons or entities other than the intended
>> recipient is prohibited. If you received this in error, please
>> contact the sender and delete the material from any computer.
>> ---------------------------------------------------------------------
>> --------- Monitor Your Dynamic Infrastructure at Any Scale With
>> Datadog!
>> Get real-time metrics from all of your servers, apps and tools in one
>> place.
>> SourceForge users - Click here to start your Free Trial of Datadog
>> now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> ----------------------------------------------------------------------
> -------- Monitor Your Dynamic Infrastructure at Any Scale With
> Datadog!
> Get real-time metrics from all of your servers, apps and tools in one
> place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>



________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to