Hi Folks, Running a system with 2 controllers and 20+ payload cards on 4.3, I am getting controller reboots that look similar to tickets 946/955 (safmsgnd :On payloads restart saImmOiImplementerSet FAILED with 14 )
2014-07-24T15:33:56.793734+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: Handle use is blocked by pending reply on syncronous call 2014-07-24T15:33:56.793811+00:00 scm2 osafimmnd[2293]: NO Implementer locally disconnected. Marking it as doomed 3 <18, 1100f> (safAmfService) 2014-07-24T15:33:56.794381+00:00 scm2 osafamfd[2462]: WA saImmOiRtObjectUpdate of 'safSu=SCM2,safSg=2N,safApp=OpenSAF' saAmfSURestartCount failed with 9 2014-07-24T15:33:56.794846+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: Client 77309480975 not found in server 2014-07-24T15:33:56.795427+00:00 scm2 osafamfd[2462]: WA saImmOiRtObjectUpdate of 'safSu=SCM2,safSg=2N,safApp=OpenSAF' saAmfSUNumCurrStandbySIs failed with 9 2014-07-24T15:33:56.795881+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: Client 77309480975 not found in server 2014-07-24T15:33:56.796381+00:00 scm2 osafamfd[2462]: WA saImmOiRtObjectUpdate of 'safSu=SCM2,safSg=2N,safApp=OpenSAF' saAmfSUNumCurrActiveSIs failed with 9 2014-07-24T15:33:56.797709+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: Client 77309480975 not found in server 2014-07-24T15:33:56.829827+00:00 scm2 osafamfd[2462]: NO Re-initializing with IMM 2014-07-24T15:33:56.830303+00:00 scm2 osafimmnd[2293]: WA IMMND - Client Node Get Failed for cli_hdl 77309480975 2014-07-24T15:33:56.845047+00:00 scm2 osafamfd[2462]: ER saImmOiImplementerSet failed 14 2014-07-24T15:33:56.845157+00:00 scm2 osafamfd[2462]: ER exiting since avd_imm_impl_set failed 2014-07-24T15:33:56.853047+00:00 scm2 osafamfnd[3093]: ER AMF director unexpectedly crashed 2014-07-24T15:33:56.853120+00:00 scm2 osafamfnd[3093]: Rebooting OpenSAF NodeId = 69647 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 69647, SupervisionTime = 0 Doesn't happen frequently, but it is triggered by a shutdown/power cycle a payload card. In pushing the problem around, lowering the TIPC timeout (normally set to 10 seconds to account for our network behavior) down to 4-5 seconds avoids the failure or reduces the likelihood to a level that it didn't happen despite my attempts. Not yet familiar enough with the code to have a handle on the details, but it is reasonable to believe the TIPC timeout forces a cleaner detection/recovery from the communication loss, thus avoiding the error? If that is the case, might it be reasonable to increase the synchronous message timeout as a workaround until there's a solution? Thanks for your insight, -andy ------------------------------------------------------------------------------ Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
