Hi Minh,

OK, so CLM callback can handle before/after AMFND down.
In case AMFND down then CLM callback, current flow works correctly.
In case CLM callback then AMFND down:
- AMFND already up before: current flow will do clm_node_exit_complete(), still 
works correctly.
- AMFND not yet up before (node reboot up/first up): this is our issue.
  + current only delete node without care msg_id counter.

Best Regards,
Thuan

-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au> 
Sent: Monday, July 9, 2018 8:52 AM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; 
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message 
ID mismatch [#2891]

This problem possibly happens in normal situation, not only "AMFND not yet up", 
so seeing it as a race and can also fixes the problem in case of "AMFND already 
up"


On 09/07/18 11:41, Minh Hon Chau wrote:
> Hi Thuan,
>
> I often see mds down coming before clm callback in my test, the 
> problem in this ticket has those events coming reverse order.
>
> <143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >>
> avd_mds_avnd_down_evh: 2030f, 0x55895024e410
> <143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >>
> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
> <143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >> 
> avd_node_mark_absent ...
> <143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 << 
> avd_mds_avnd_down_evh
>
> <143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >> 
> saClmDispatch ...
> <143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >> clm_track_cb: '0'
> '4' '1'
> <143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR 
> numberOfMembers:'4', numberOfItems:'1'
> <143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0, 
> node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
> <142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN clm_track_cb: CLM 
> node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF cluster 
> member; MDS down received
> <143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 << clm_track_cb ...
> <143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372 osafamfd 
> [meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 << 
> saClmDispatch
>
>
> Reset counter can avoid the msg mismatch, but I'm not sure it's 
> completely fixed since the mds down event is not handled properly.
> If you have mds down event handled properly in both cases, will there 
> be this msg mismatch problem?
>
> Thanks,
> Minh
>
> On 09/07/18 11:11, Tran Thuan wrote:
>> Hi Minh,
>>
>> If I understand correctly, CLM callback always before AMFND down/up 
>> event.
>> The problem is just because AMFND not yet up and CLM callback receive 
>> MDS down of node and delete node.
>> After that AMFND down event come cannot find the node, cannot find 
>> the node is expected I guess.
>> So, we only need to reset counter to make thing right, what do you 
>> think?
>>
>> <143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2448"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
>> '0' '4' '1'
>> <142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2451"] 4815:amf/amfd/clm.cc:299 IN clm_track_cb:
>> CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not an AMF 
>> cluster member; MDS down received
>> <143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2452"] 4815:amf/amfd/clm.cc:418 << clm_track_cb
>> <143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2465"] 4815:amf/amfd/ndfsm.cc:772 >>
>> avd_mds_avnd_down_evh: 2030f, (nil)
>> <143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2466"] 4815:amf/amfd/ndfsm.cc:821 << 
>> avd_mds_avnd_down_evh
>>
>> <143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2596"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
>> '0' '4' '1'
>> <143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2605"] 4815:amf/amfd/clm.cc:418 << clm_track_cb
>> <143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2631"] 4815:amf/amfd/ndfsm.cc:750 TR amfnd on 2030f 
>> is up
>> <143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="2686"] 4815:amf/amfd/ndfsm.cc:772 >>
>> avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940
>> <143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815 osafamfd 
>> [meta sequenceId="3008"] 4815:amf/amfd/ndfsm.cc:821 << 
>> avd_mds_avnd_down_evh
>>
>> Best Regards,
>> Thuan
>>
>>
>> -----Original Message-----
>> From: Minh Hon Chau <minh.c...@dektech.com.au>
>> Sent: Monday, July 9, 2018 7:15 AM
>> To: thuan.tran <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; 
>> hans.nordeb...@ericsson.com; gary....@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to 
>> avoid message ID mismatch [#2891]
>>
>> Hi,
>>
>> In normal situation when I stop Opensaf or reboot the node, I often 
>> get MDS down first, then CLM track callback. [1]
>>
>> The problem in this ticket is because AMFD has CLM track callback 
>> coming before MDS down. [2]
>>
>> We need to see it as a race condition, not because of missing the 
>> counters reset, so that the amfd's function to deal with MDS down 
>> should be called. That would be equivalent to [1], which is currently 
>> working.
>>
>> With this patch, we still don't have avd_mds_avnd_down_evh() to be 
>> called, that function processes many things in the event of mds down.
>> (ie. @nodes_exit_cnt, ...)
>>
>> I think the node_info.member should be set TRUE or FALSE in the event 
>> of CLM callback since it means CLM membership. We can use both 
>> @member and one of variables indicating MDS down, to trap each other 
>> in the events of MDS down and CLM callback, so that in the end we 
>> have
>> avd_mds_avnd_down_evh() and nodeid is removed out of node_id_db in 
>> both [1] and [2].
>>
>> Thanks,
>> Minh
>> On 06/07/18 20:36, thuan.tran wrote:
>>> There is a case that after AMFD send reboot order due to “out of 
>>> sync window”.
>>> AMFD receive CLM track callback but node is not member yet and 
>>> delete node.
>>> Later AMFND MDS down will not reset msg_id counter since it cannot 
>>> find node.
>>> When node reboot up, AMFD continue use current msg_id counter send 
>>> to AMFND cause messasge ID mismatch in AMFND then AMFND order reboot 
>>> itself node.
>>> ---
>>>    src/amf/amfd/clm.cc | 7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index
>>> e113a65f9..4a15d5ad7 100644
>>> --- a/src/amf/amfd/clm.cc
>>> +++ b/src/amf/amfd/clm.cc
>>> @@ -316,9 +316,14 @@ static void clm_track_cb(
>>>                       __FUNCTION__, node_name.c_str());
>>>                goto done;
>>>              } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
>>> -            LOG_IN("%s: CLM node '%s' is not an AMF cluster member; 
>>> MDS down received",
>>> +            LOG_IN("%s: CLM node '%s' is ABSENT; MDS down 
>>> +received",
>>>                       __FUNCTION__, node_name.c_str());
>>>                avd_node_delete_nodeid(node);
>>> +            /* Reset msg_id because AVND MDS down may come later
>>> +              and cannot find node to reset these, cause message ID
>>> mismatch. */
>>> +            node->rcv_msg_id = 0;
>>> +            node->snd_msg_id = 0;
>>> +            node->node_info.member = SA_FALSE;
>>>                goto done;
>>>              }
>>>              TRACE(" Node Left: rootCauseEntity %s for node %u",
>>
>>
>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to