Hi Minh, OK, so CLM callback can handle before/after AMFND down. In case AMFND down then CLM callback, current flow works correctly. In case CLM callback then AMFND down: - AMFND already up before: current flow will do clm_node_exit_complete(), still works correctly. - AMFND not yet up before (node reboot up/first up): this is our issue. + current only delete node without care msg_id counter.
Best Regards, Thuan -----Original Message----- From: Minh Hon Chau <minh.c...@dektech.com.au> Sent: Monday, July 9, 2018 8:52 AM To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; hans.nordeb...@ericsson.com; gary....@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891] This problem possibly happens in normal situation, not only "AMFND not yet up", so seeing it as a race and can also fixes the problem in case of "AMFND already up" On 09/07/18 11:41, Minh Hon Chau wrote: > Hi Thuan, > > I often see mds down coming before clm callback in my test, the > problem in this ticket has those events coming reverse order. > > <143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >> > avd_mds_avnd_down_evh: 2030f, 0x55895024e410 > <143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >> > avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster' > <143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >> > avd_node_mark_absent ... > <143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 << > avd_mds_avnd_down_evh > > <143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >> > saClmDispatch ... > <143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >> clm_track_cb: '0' > '4' '1' > <143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR > numberOfMembers:'4', numberOfItems:'1' > <143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0, > node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3 > <142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN clm_track_cb: CLM > node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF cluster > member; MDS down received > <143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 << clm_track_cb ... > <143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372 osafamfd > [meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 << > saClmDispatch > > > Reset counter can avoid the msg mismatch, but I'm not sure it's > completely fixed since the mds down event is not handled properly. > If you have mds down event handled properly in both cases, will there > be this msg mismatch problem? > > Thanks, > Minh > > On 09/07/18 11:11, Tran Thuan wrote: >> Hi Minh, >> >> If I understand correctly, CLM callback always before AMFND down/up >> event. >> The problem is just because AMFND not yet up and CLM callback receive >> MDS down of node and delete node. >> After that AMFND down event come cannot find the node, cannot find >> the node is expected I guess. >> So, we only need to reset counter to make thing right, what do you >> think? >> >> <143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2448"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb: >> '0' '4' '1' >> <142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2451"] 4815:amf/amfd/clm.cc:299 IN clm_track_cb: >> CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not an AMF >> cluster member; MDS down received >> <143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2452"] 4815:amf/amfd/clm.cc:418 << clm_track_cb >> <143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2465"] 4815:amf/amfd/ndfsm.cc:772 >> >> avd_mds_avnd_down_evh: 2030f, (nil) >> <143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2466"] 4815:amf/amfd/ndfsm.cc:821 << >> avd_mds_avnd_down_evh >> >> <143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2596"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb: >> '0' '4' '1' >> <143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2605"] 4815:amf/amfd/clm.cc:418 << clm_track_cb >> <143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2631"] 4815:amf/amfd/ndfsm.cc:750 TR amfnd on 2030f >> is up >> <143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="2686"] 4815:amf/amfd/ndfsm.cc:772 >> >> avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940 >> <143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815 osafamfd >> [meta sequenceId="3008"] 4815:amf/amfd/ndfsm.cc:821 << >> avd_mds_avnd_down_evh >> >> Best Regards, >> Thuan >> >> >> -----Original Message----- >> From: Minh Hon Chau <minh.c...@dektech.com.au> >> Sent: Monday, July 9, 2018 7:15 AM >> To: thuan.tran <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; >> hans.nordeb...@ericsson.com; gary....@dektech.com.au >> Cc: opensaf-devel@lists.sourceforge.net >> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to >> avoid message ID mismatch [#2891] >> >> Hi, >> >> In normal situation when I stop Opensaf or reboot the node, I often >> get MDS down first, then CLM track callback. [1] >> >> The problem in this ticket is because AMFD has CLM track callback >> coming before MDS down. [2] >> >> We need to see it as a race condition, not because of missing the >> counters reset, so that the amfd's function to deal with MDS down >> should be called. That would be equivalent to [1], which is currently >> working. >> >> With this patch, we still don't have avd_mds_avnd_down_evh() to be >> called, that function processes many things in the event of mds down. >> (ie. @nodes_exit_cnt, ...) >> >> I think the node_info.member should be set TRUE or FALSE in the event >> of CLM callback since it means CLM membership. We can use both >> @member and one of variables indicating MDS down, to trap each other >> in the events of MDS down and CLM callback, so that in the end we >> have >> avd_mds_avnd_down_evh() and nodeid is removed out of node_id_db in >> both [1] and [2]. >> >> Thanks, >> Minh >> On 06/07/18 20:36, thuan.tran wrote: >>> There is a case that after AMFD send reboot order due to “out of >>> sync window”. >>> AMFD receive CLM track callback but node is not member yet and >>> delete node. >>> Later AMFND MDS down will not reset msg_id counter since it cannot >>> find node. >>> When node reboot up, AMFD continue use current msg_id counter send >>> to AMFND cause messasge ID mismatch in AMFND then AMFND order reboot >>> itself node. >>> --- >>> src/amf/amfd/clm.cc | 7 ++++++- >>> 1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index >>> e113a65f9..4a15d5ad7 100644 >>> --- a/src/amf/amfd/clm.cc >>> +++ b/src/amf/amfd/clm.cc >>> @@ -316,9 +316,14 @@ static void clm_track_cb( >>> __FUNCTION__, node_name.c_str()); >>> goto done; >>> } else if (node->node_state == AVD_AVND_STATE_ABSENT) { >>> - LOG_IN("%s: CLM node '%s' is not an AMF cluster member; >>> MDS down received", >>> + LOG_IN("%s: CLM node '%s' is ABSENT; MDS down >>> +received", >>> __FUNCTION__, node_name.c_str()); >>> avd_node_delete_nodeid(node); >>> + /* Reset msg_id because AVND MDS down may come later >>> + and cannot find node to reset these, cause message ID >>> mismatch. */ >>> + node->rcv_msg_id = 0; >>> + node->snd_msg_id = 0; >>> + node->node_info.member = SA_FALSE; >>> goto done; >>> } >>> TRACE(" Node Left: rootCauseEntity %s for node %u", >> >> > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel