Comments inline:

> -----Original Message-----
> From: Shu Wang [mailto:[email protected]]
> Sent: Tuesday, September 29, 2015 3:26 AM
> To: Anders Widell; Mathivanan Naickan Palanivelu
> Cc: [email protected]
> Subject: RE: [users] How to correct a split-brain situation
> 
> Thanks, Mathi and Anders.
> 
> In the latest split brain scenario, I also observed that active controller
> controller1 failed with this error message:
> ER AMF director heart beat timeout, generating core for amfd
> opensaf_reboot was invoked to stop and start OpenSAF on the node.
> 
> Standby controller controller2 promoted itself to active controller.
> However, when the old active controller controller1 restarted, it failed to 
> find
> its peer controller2 within 2 seconds, then controller1 set itself to active:
> No peer available => Setting Active role for this node
> 
> IMMND in both controllers threw errors:
> Sep  3 07:53:36 controller1 osafimmnd[8181]: WA MDS problem-2, giving up
> Sep  3 07:53:36 controller1 osafimmnd[8181]: ER IMMND - Periodic server job
> failed Sep  3 07:53:36 controller1 osafimmnd[8181]: ER Failed, exiting...
> 
> Sep  3 07:53:36 controller2 osafimmnd[19218]: ER IMMND forced to restart
> on order from IMMD, exiting Sep  3 07:53:36 controller2 osafimmd[19203]:
> WA IMMND coordinator at 20b0f apparently crashed => electing new coord
> Sep  3 07:53:36 controller2 osafimmd[19203]: ER Failed to find candidate for
> new IMMND coordinator Sep  3 07:53:36 controller2 osafimmd[19203]: ER
> Active IMMD has to restart the IMMSv. All IMMNDs will restart
> 
> Controller2 rebooted. At the same time period, 2 payload nodes were also
> restarting and they failed to start, because controller1 was starting up and
> controller2 was rebooting, there was no functioning controller.
> 
> My questions are:
> 
> 1. When controller1 restarted, it should find active controller controller2, 
> but
> it did not. Can we use RDE_DISCOVER_PEER_TIMEOUT to give it more time to
[Mathi]
Yes. 
The discovery latency is dependent on your network's latency!

> find its peer? What would be the cons to set RDE_DISCOVER_PEER_TIMEOUT
> to a larger number?
>
[Mathi]
In the initial cluster startup case, it would result in longer cluster role 
determination.
In this test case, it would result in the standby taking more time to join the 
cluster.
 
> 2. How to avoid to getting into the scenario that both controllers were
> rebooting/restarting, and no functioning active controller in the cluster?
> 
[Mathi]
Have an external mechanism that takes down the second controller, is one option.
Alternatively FM/RDE are reference implementations that can be 
enhanced/modified by the user.

Having said that there are also some enhancements being planned in the next 
release.

Cheers,
Mathi.

> Thanks!
> 
> Shu Wang
> 
> 
> -----Original Message-----
> From: Anders Widell [mailto:[email protected]]
> Sent: Thursday, September 24, 2015 6:35 AM
> To: Mathivanan Naickan Palanivelu; Shu Wang
> Cc: [email protected]
> Subject: Re: [users] How to correct a split-brain situation
> 
> Also, I must point out the importance of having a redundant network
> connection between the nodes; otherwise it will be a single point of failure.
> Is your network duplicated?
> 
> / Anders Widell
> 
> On 09/24/2015 12:21 PM, Mathivanan Naickan Palanivelu wrote:
> > Hi,
> >
> > Note that FMS_PROMOTE_ACTIVE_TIMER and opensaf_reboot scripts are
> two
> > platform adaptation attributes in OpenSAF w.r.t failover and fencing. An
> OpenSAF user can customize these in their deployments.
> >
> > Upon receiving connection loss indication with the active controller,
> > the STANDBY controller starts this promote active timer (see
> FMS_PROMOTE_ACTIVE_TIMER in /etc/opensaf/fmd.conf).
> > This timer acts as a tolerance mechanism to handle or differentiate
> > temporary link-flaps and false-positives in your network.
> > Upon expiry of this timer, the STANDBY invokes opensaf_reboot script
> > (with the intention to reboot the ACTIVE node) and subsequently
> promotes itself to ACTIVE.
> >
> > The opensaf_reboot script is an integration point for the OpenSAF
> > user. So, during failover when this opensaf_reboot script is invoked
> > the node information (node_id, PLM ee name) of the peer ACTIVE node is
> passed as input to this script.
> > Inside this script, the user can modify so as to invoke 'commands'
> > that will perform remote reboots of the old ACTIVE node.
> > The 'commands' here could be an IPMI command or any STONITH
> agent/command.
> >
> > Cheers,
> > Mathi.
> >
> > ----- [email protected] wrote:
> >
> >> When a system gets into split-brain scenario, both controllers assume
> >> active role. How does a payload node distinguish which controller it
> >> is associated to? Is there a way that we find out which payload nodes
> >> connect to which controller?
> >>
> >> Our cluster needs to provide service 24x7.  So restarting the cluster
> >> is not possible when this situation occurs.  What is the best way to
> >> correct a split-brain situation? If we stop and restart one of the
> >> controller nodes to allow it to rejoin the other controller, should
> >> we also restart the payload nodes associated to that controller?
> >> Those payload nodes should be stopped before stopping their
> >> associated controller node, correct?
> >>
> >> Shu Wang
> >>
> >>
> >>
> >>
> >> ________________________________
> >> The information transmitted herein is intended only for the person or
> >> entity to which it is addressed and may contain confidential,
> >> proprietary and/or privileged material. Any review, retransmission,
> >> dissemination or other use of, or taking of any action in reliance
> >> upon, this information by persons or entities other than the intended
> >> recipient is prohibited. If you received this in error, please
> >> contact the sender and delete the material from any computer.
> >> ---------------------------------------------------------------------
> >> --------- Monitor Your Dynamic Infrastructure at Any Scale With
> >> Datadog!
> >> Get real-time metrics from all of your servers, apps and tools in one
> >> place.
> >> SourceForge users - Click here to start your Free Trial of Datadog
> >> now!
> >> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
> >> _______________________________________________
> >> Opensaf-users mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> > ----------------------------------------------------------------------
> > -------- Monitor Your Dynamic Infrastructure at Any Scale With
> > Datadog!
> > Get real-time metrics from all of your servers, apps and tools in one
> > place.
> > SourceForge users - Click here to start your Free Trial of Datadog now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
> > _______________________________________________
> > Opensaf-users mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >
> 
> 
> 
> ________________________________
> The information transmitted herein is intended only for the person or entity
> to which it is addressed and may contain confidential, proprietary and/or
> privileged material. Any review, retransmission, dissemination or other use
> of, or taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received 
> this in
> error, please contact the sender and delete the material from any computer.

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to