Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Tran Thuan Sun, 08 Jul 2018 23:33:43 -0700

Hi Minh,

Since AMFND not yet up before, other fields are not updated yet.
I think only 3 fields is changed in this scenario, that's why my is reset them 
only.
Do you think so?


Best Regards,
Thuan 

-----Original Message-----
From: Minh Hon Chau <minh.c...@dektech.com.au> 
Sent: Monday, July 9, 2018 1:22 PM
To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; 
hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message 
ID mismatch [#2891]

Hi,

Ok, so we can ignore the "AMFND already up" case.

"- AMFND not yet up before (node reboot up/first up): this is our issue.
   + current only delete node without care msg_id counter."

Are you able to reproduce this problem?
You only reset the msg_id counter of a node, how about the others: 
adest, recvr_fail_sw, node_info.initialViewNumber, nodes_exit_cnt,.. 
they all are checked in the other places.
Also, when the node goes down, the standby resets all of above variables 
directly, while the active doesn't as of this issue or only resets msg_id 
counters within your patch. The active and standby may not be consistent.

Another thing you may consider, if any new/old attributes/variables of AVD_AVND 
is going to be added/removed, we only look at one place - event of MDS down, 
don't have to look at/reset those ones in the CLM callback.

Thanks
Minh

On 09/07/18 12:12, Tran Thuan wrote:
> Hi Minh,
>
> OK, so CLM callback can handle before/after AMFND down.
> In case AMFND down then CLM callback, current flow works correctly.
> In case CLM callback then AMFND down:
> - AMFND already up before: current flow will do clm_node_exit_complete(), 
> still works correctly.
> - AMFND not yet up before (node reboot up/first up): this is our issue.
>    + current only delete node without care msg_id counter.
>
> Best Regards,
> Thuan
>
> -----Original Message-----
> From: Minh Hon Chau <minh.c...@dektech.com.au>
> Sent: Monday, July 9, 2018 8:52 AM
> To: Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; 
> hans.nordeb...@ericsson.com; gary....@dektech.com.au
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to 
> avoid message ID mismatch [#2891]
>
> This problem possibly happens in normal situation, not only "AMFND not yet 
> up", so seeing it as a race and can also fixes the problem in case of "AMFND 
> already up"
>
>
> On 09/07/18 11:41, Minh Hon Chau wrote:
>> Hi Thuan,
>>
>> I often see mds down coming before clm callback in my test, the 
>> problem in this ticket has those events coming reverse order.
>>
>> <143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >>
>> avd_mds_avnd_down_evh: 2030f, 0x55895024e410
>> <143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >>
>> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
>> <143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >> 
>> avd_node_mark_absent ...
>> <143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 << 
>> avd_mds_avnd_down_evh
>>
>> <143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >> 
>> saClmDispatch ...
>> <143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >> clm_track_cb: '0'
>> '4' '1'
>> <143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR 
>> numberOfMembers:'4', numberOfItems:'1'
>> <143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0, 
>> node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
>> <142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN clm_track_cb: 
>> CLM node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF cluster 
>> member; MDS down received
>> <143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 << clm_track_cb ...
>> <143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372 osafamfd 
>> [meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 << 
>> saClmDispatch
>>
>>
>> Reset counter can avoid the msg mismatch, but I'm not sure it's 
>> completely fixed since the mds down event is not handled properly.
>> If you have mds down event handled properly in both cases, will there 
>> be this msg mismatch problem?
>>
>> Thanks,
>> Minh
>>
>> On 09/07/18 11:11, Tran Thuan wrote:
>>> Hi Minh,
>>>
>>> If I understand correctly, CLM callback always before AMFND down/up 
>>> event.
>>> The problem is just because AMFND not yet up and CLM callback 
>>> receive MDS down of node and delete node.
>>> After that AMFND down event come cannot find the node, cannot find 
>>> the node is expected I guess.
>>> So, we only need to reset counter to make thing right, what do you 
>>> think?
>>>
>>> <143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2448"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
>>> '0' '4' '1'
>>> <142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2451"] 4815:amf/amfd/clm.cc:299 IN clm_track_cb:
>>> CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not an AMF 
>>> cluster member; MDS down received
>>> <143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2452"] 4815:amf/amfd/clm.cc:418 << 
>>> clm_track_cb
>>> <143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2465"] 4815:amf/amfd/ndfsm.cc:772 >>
>>> avd_mds_avnd_down_evh: 2030f, (nil)
>>> <143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2466"] 4815:amf/amfd/ndfsm.cc:821 << 
>>> avd_mds_avnd_down_evh
>>>
>>> <143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2596"] 4815:amf/amfd/clm.cc:223 >> clm_track_cb:
>>> '0' '4' '1'
>>> <143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2605"] 4815:amf/amfd/clm.cc:418 << 
>>> clm_track_cb
>>> <143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2631"] 4815:amf/amfd/ndfsm.cc:750 TR 
>>> amfnd on 2030f is up
>>> <143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="2686"] 4815:amf/amfd/ndfsm.cc:772 >>
>>> avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940
>>> <143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815 
>>> osafamfd [meta sequenceId="3008"] 4815:amf/amfd/ndfsm.cc:821 << 
>>> avd_mds_avnd_down_evh
>>>
>>> Best Regards,
>>> Thuan
>>>
>>>
>>> -----Original Message-----
>>> From: Minh Hon Chau <minh.c...@dektech.com.au>
>>> Sent: Monday, July 9, 2018 7:15 AM
>>> To: thuan.tran <thuan.t...@dektech.com.au>; nagen...@hasolutions.in; 
>>> hans.nordeb...@ericsson.com; gary....@dektech.com.au
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter to 
>>> avoid message ID mismatch [#2891]
>>>
>>> Hi,
>>>
>>> In normal situation when I stop Opensaf or reboot the node, I often 
>>> get MDS down first, then CLM track callback. [1]
>>>
>>> The problem in this ticket is because AMFD has CLM track callback 
>>> coming before MDS down. [2]
>>>
>>> We need to see it as a race condition, not because of missing the 
>>> counters reset, so that the amfd's function to deal with MDS down 
>>> should be called. That would be equivalent to [1], which is 
>>> currently working.
>>>
>>> With this patch, we still don't have avd_mds_avnd_down_evh() to be 
>>> called, that function processes many things in the event of mds down.
>>> (ie. @nodes_exit_cnt, ...)
>>>
>>> I think the node_info.member should be set TRUE or FALSE in the 
>>> event of CLM callback since it means CLM membership. We can use both 
>>> @member and one of variables indicating MDS down, to trap each other 
>>> in the events of MDS down and CLM callback, so that in the end we 
>>> have
>>> avd_mds_avnd_down_evh() and nodeid is removed out of node_id_db in 
>>> both [1] and [2].
>>>
>>> Thanks,
>>> Minh
>>> On 06/07/18 20:36, thuan.tran wrote:
>>>> There is a case that after AMFD send reboot order due to “out of 
>>>> sync window”.
>>>> AMFD receive CLM track callback but node is not member yet and 
>>>> delete node.
>>>> Later AMFND MDS down will not reset msg_id counter since it cannot 
>>>> find node.
>>>> When node reboot up, AMFD continue use current msg_id counter send 
>>>> to AMFND cause messasge ID mismatch in AMFND then AMFND order 
>>>> reboot itself node.
>>>> ---
>>>>     src/amf/amfd/clm.cc | 7 ++++++-
>>>>     1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc index
>>>> e113a65f9..4a15d5ad7 100644
>>>> --- a/src/amf/amfd/clm.cc
>>>> +++ b/src/amf/amfd/clm.cc
>>>> @@ -316,9 +316,14 @@ static void clm_track_cb(
>>>>                        __FUNCTION__, node_name.c_str());
>>>>                 goto done;
>>>>               } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
>>>> -            LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
>>>> MDS down received",
>>>> +            LOG_IN("%s: CLM node '%s' is ABSENT; MDS down 
>>>> +received",
>>>>                        __FUNCTION__, node_name.c_str());
>>>>                 avd_node_delete_nodeid(node);
>>>> +            /* Reset msg_id because AVND MDS down may come later
>>>> +              and cannot find node to reset these, cause message 
>>>> + ID
>>>> mismatch. */
>>>> +            node->rcv_msg_id = 0;
>>>> +            node->snd_msg_id = 0;
>>>> +            node->node_info.member = SA_FALSE;
>>>>                 goto done;
>>>>               }
>>>>               TRACE(" Node Left: rootCauseEntity %s for node %u",
>>>
>
>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1/1] amf: amfd should reset msg_id counter to avoid message ID mismatch [#2891]

Reply via email to