Hi, Please, share the complete syslog of both the controllers.
what is the TIPC tolerance timeout presently on the setup? /Neel. On Friday 25 July 2014 11:29 PM, Andrew Riley wrote: > Hi Folks, > > Running a system with 2 controllers and 20+ payload cards on 4.3, I am > getting controller reboots that look similar to tickets 946/955 (safmsgnd :On > payloads restart saImmOiImplementerSet FAILED with 14 ) > > 2014-07-24T15:33:56.793734+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: > Handle use is blocked by pending reply on syncronous call > 2014-07-24T15:33:56.793811+00:00 scm2 osafimmnd[2293]: NO Implementer locally > disconnected. Marking it as doomed 3 <18, 1100f> (safAmfService) > 2014-07-24T15:33:56.794381+00:00 scm2 osafamfd[2462]: WA > saImmOiRtObjectUpdate of 'safSu=SCM2,safSg=2N,safApp=OpenSAF' > saAmfSURestartCount failed with 9 > 2014-07-24T15:33:56.794846+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: > Client 77309480975 not found in server > 2014-07-24T15:33:56.795427+00:00 scm2 osafamfd[2462]: WA > saImmOiRtObjectUpdate of 'safSu=SCM2,safSg=2N,safApp=OpenSAF' > saAmfSUNumCurrStandbySIs failed with 9 > 2014-07-24T15:33:56.795881+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: > Client 77309480975 not found in server > 2014-07-24T15:33:56.796381+00:00 scm2 osafamfd[2462]: WA > saImmOiRtObjectUpdate of 'safSu=SCM2,safSg=2N,safApp=OpenSAF' > saAmfSUNumCurrActiveSIs failed with 9 > 2014-07-24T15:33:56.797709+00:00 scm2 osafimmnd[2293]: WA ERR_BAD_HANDLE: > Client 77309480975 not found in server > 2014-07-24T15:33:56.829827+00:00 scm2 osafamfd[2462]: NO Re-initializing with > IMM > 2014-07-24T15:33:56.830303+00:00 scm2 osafimmnd[2293]: WA IMMND - Client Node > Get Failed for cli_hdl 77309480975 > 2014-07-24T15:33:56.845047+00:00 scm2 osafamfd[2462]: ER > saImmOiImplementerSet failed 14 > 2014-07-24T15:33:56.845157+00:00 scm2 osafamfd[2462]: ER exiting since > avd_imm_impl_set failed > 2014-07-24T15:33:56.853047+00:00 scm2 osafamfnd[3093]: ER AMF director > unexpectedly crashed > 2014-07-24T15:33:56.853120+00:00 scm2 osafamfnd[3093]: Rebooting OpenSAF > NodeId = 69647 EE Name = , Reason: local AVD down(Adest) or both AVD > down(Vdest) received, OwnNodeId = 69647, SupervisionTime = 0 > > Doesn't happen frequently, but it is triggered by a shutdown/power cycle a > payload card. > > In pushing the problem around, lowering the TIPC timeout (normally set to 10 > seconds to account for our network behavior) down to 4-5 seconds avoids the > failure or reduces the likelihood to a level that it didn't happen despite my > attempts. > > Not yet familiar enough with the code to have a handle on the details, but it > is reasonable to believe the TIPC timeout forces a cleaner detection/recovery > from the communication loss, thus avoiding the error? If that is the case, > might it be reasonable to increase the synchronous message timeout as a > workaround until there's a solution? > > Thanks for your insight, > -andy > > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
