Hi, We are using opensaf 4.4.0. We have a cluster that has 2 controllers and 3 payloads. We had a situation where the osafamfnd was killed on the active controller, the standby controller issued the following warnings in the messages log in that second: Jun 19 08:29:06 vervet osafimmd[25698]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN) Jun 19 08:29:06 vervet osafimmnd[25713]: WA DISCARD DUPLICATE FEVS message:10756 Jun 19 08:29:06 vervet osafimmnd[25713]: WA Error code 2 returned for message type 57 - ignoring Jun 19 08:29:06 vervet osafimmd[25698]: WA IMMND DOWN on active controller f2 detected at standby immd!! f1. Possible failover Jun 19 08:29:06 vervet osafimmnd[25713]: WA DISCARD DUPLICATE FEVS message:10757 Jun 19 08:29:06 vervet osafimmnd[25713]: WA Error code 2 returned for message type 57 - ignoring Jun 19 08:29:06 vervet osafimmd[25698]: NO Skipping re-send of fevs message 10756 since it has recently been resent. Jun 19 08:29:06 vervet osafimmd[25698]: NO Skipping re-send of fevs message 10757 since it has recently been resent. Jun 19 08:29:06 vervet osafimmnd[25713]: NO Global discard node received for nodeId:2020f pid:9609 Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 9 <0, 2020f(down)> (safLckService) Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 8 <0, 2020f(down)> (safEvtService) Jun 19 08:29:06 vervet osafimmnd[25713]: NO Implementer disconnected 6 <0, 2020f(down)> (safMsgGrpService)
After those messages were issued, the standby controller and 3 payloads continued to stay up for 3 minutes and then the following errors were in the standby controller messages log: Jun 19 08:32:06 vervet osafamfnd[25808]: ER AMF director unexpectedly crashed Jun 19 08:32:06 vervet osafimmnd[25713]: NO No IMMD service => cluster restart, exiting Jun 19 08:32:06 vervet osafamfnd[25808]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 0 Jun 19 08:32:06 vervet osafamfd[25790]: NO Re-initializing with IMM Jun 19 08:32:06 vervet opensaf_bounce: Bouncing local node; timeout= Jun 19 08:32:06 vervet opensafd: Stopping OpenSAF Services Jun 19 08:33:02 vervet osafamfwd[25932]: TIMEOUT receiving AMF health check request, generating core for amfnd Jun 19 08:33:02 vervet osafamfwd[25932]: Last received healthcheck cnt=105 at Fri Jun 19 08:32:02 2015 Jun 19 08:33:02 vervet osafamfwd[25932]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: AMFND unresponsive, AMFWDOG initiated system reboot, OwnNodeId = 131343, SupervisionTime = 0 During this time, the osafdtmd and osafamfwd stayed up on the original active controller. At this point the standby controller and all payloads rebooted. So my questions are: 1) Why was there a delay of 3 minutes before anything happened? 2) Why didn't the standby immediately take the active role, and keep itself and the payloads up? I.e. why did the standby and payloads reboot? 3) A more basic question is, exactly how does the standby node know that the active is dead and to take the active role? I would greatly appreciate any help with these questions. thanks ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
