Hi Minh,
What do you think about this?
static void clm_track_cb(
...
} else if (node->node_state == AVD_AVND_STATE_ABSENT) {
LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
MDS down received",
__FUNCTION__, node_name.c_str());
//avd_node_delete_nodeid(node); [Thuan]: Don’t delete
node and let avnd_down() do proper stuff
goto done;
}
...
*Best Regards,*
*Thuan ***
*From:*Minh Hon Chau <minh.c...@dektech.com.au>
*Sent:* Tuesday, July 10, 2018 3:39 PM
*To:* Tran Thuan <thuan.t...@dektech.com.au>; nagen...@hasolutions.in;
hans.nordeb...@ericsson.com; gary....@dektech.com.au
*Cc:* opensaf-devel@lists.sourceforge.net
*Subject:* Re: [PATCH 1/1] amf: amfd should reset msg_id counter to
avoid message ID mismatch [#2891]
Hi Thuan,
Please see comments inline.
Thanks
Minh
On 10/07/18 18:16, Tran Thuan wrote:
Hi Nagu,
I just update the method to reproduce the issue reported by #2891.
It happens after tipc disconnect/connect from PL-3.
Minh raise about issue that SC back after headless then PL reboot
after info synced to SC.
@Minh: about your amfd trace. I would like to confirm some points:
* Why CLM callback receive MDS down at 14:54:28 but later
receive amfnd up??? Then until 14:58:28 receive amfnd down?
Also I don’t see trace “avd_mds_avnd_down_evh: 2040f, (nil)” as
analysis of ticket’s issue.
* Do you reproduce with official opensaf code (no hack code),
just reboot PL-4 after SC back from headless?
[Minh]: I did return in avd_mds_avnd_down_evh() because I would like
to see how amfd reacts if CLM callback comes first (I haven't ever
been able to reproduce this problem). But does it matter? The point is
that, if reboot a PL, both CLM callback/MDS down can come before a PL
joins AMF cluster and those events also can come after amfd processes
synced assignments because of the sync window.
*
* Do you have any suggestion for solution?
Current code:
static void clm_track_cb(
...
} else if (node->node_state == AVD_AVND_STATE_ABSENT) {
LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
MDS down received",
__FUNCTION__, node_name.c_str());
avd_node_delete_nodeid(node);
goto done;
}
...
The *if* is not strong enough to represent that MDS down had been
received. I think you can use @amfnd_svc_db (or others) to indicate
that MDS down really did came. And you need to delete nodeId out of
node_id_db too.
*
<143>1 2018-07-10T14:54:27.369871+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="11063"] 297:amf/amfd/siass.cc:1127 >>
avd_susi_recreate: msg_id: 0 node_id: 132111 num_sisu: 4
<143>1 2018-07-10T14:54:27.377689+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="11320"] 297:amf/amfd/siass.cc:1195 <<
avd_susi_recreate
<142>1 2018-07-10T14:54:28.270464+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="12725"] 297:amf/amfd/clm.cc:320 IN clm_track_cb:
CLM node 'safNode=PL-4,safCluster=myClmCluster' is not an AMF
cluster member; MDS down received
<143>1 2018-07-10T14:54:31.16261+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="17815"] 297:amf/amfd/ndfsm.cc:752 TR amfnd on
2040f is up
<143>1 2018-07-10T14:54:31.422545+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="17886"] 297:amf/amfd/ndfsm.cc:296 >>
avd_node_up_evh: from 2040f,
safAmfNode=PL-4,safAmfCluster=myAmfCluster
<143>1 2018-07-10T14:54:31.422868+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="17899"] 297:amf/amfd/node.cc:461 >>
avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
ABSENT => NO_CONFIG
<143>1 2018-07-10T14:54:31.42297+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="17904"] 297:amf/amfd/node.cc:466 <<
avd_node_state_set
<143>1 2018-07-10T14:54:31.423114+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="17910"] 297:amf/amfd/ndfsm.cc:536 <<
avd_node_up_evh
<143>1 2018-07-10T14:54:31.576633+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="18023"] 297:amf/amfd/node.cc:461 >>
avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
NO_CONFIG => NCS_INIT
<143>1 2018-07-10T14:54:31.576689+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="18028"] 297:amf/amfd/node.cc:466 <<
avd_node_state_set
<143>1 2018-07-10T14:58:28.032461+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21836"] 300:amf/amfd/mds.cc:463 TR avnd
2040fe497957d down
<143>1 2018-07-10T14:58:28.032592+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21837"] 297:amf/amfd/main.cc:766 >>
process_event: evt->rcv_evt 21
<143>1 2018-07-10T14:58:28.032613+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21838"] 297:amf/amfd/ndfsm.cc:774 >>
avd_mds_avnd_down_evh: 2040f, 0x55b62f9dcff0
<143>1 2018-07-10T14:58:28.032641+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21839"] 297:amf/amfd/ndproc.cc:1219 >>
avd_node_failover: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
<143>1 2018-07-10T14:58:28.032656+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21840"] 297:amf/amfd/ndfsm.cc:1113 >>
avd_node_mark_absent
<143>1 2018-07-10T14:58:28.032679+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21841"] 297:amf/amfd/imm.cc:1887 >>
avd_saImmOiRtObjectUpdate:
'safAmfNode=PL-4,safAmfCluster=myAmfCluster' saAmfNodeOperState
<143>1 2018-07-10T14:58:28.032722+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21842"] 297:amf/amfd/imm.cc:1917 <<
avd_saImmOiRtObjectUpdate
<143>1 2018-07-10T14:58:28.032738+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21843"] 297:amf/amfd/node.cc:461 >>
avd_node_state_set: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
NCS_INIT => ABSENT
<143>1 2018-07-10T14:58:28.032822+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21848"] 297:amf/amfd/node.cc:466 <<
avd_node_state_set
<143>1 2018-07-10T14:58:28.033059+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21858"] 297:amf/amfd/ndfsm.cc:1134 <<
avd_node_mark_absent
<143>1 2018-07-10T14:58:28.033102+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="21861"] 297:amf/amfd/sgproc.cc:2177 >>
avd_node_down_mw_susi_failover:
'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
<143>1 2018-07-10T14:58:28.060869+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="22222"] 297:amf/amfd/sgproc.cc:2223 <<
avd_node_down_mw_susi_failover
<143>1 2018-07-10T14:58:28.06088+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="22223"] 297:amf/amfd/sgproc.cc:2238 >>
avd_node_down_appl_susi_failover:
'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
<143>1 2018-07-10T14:58:28.076943+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="22554"] 297:amf/amfd/sgproc.cc:2311 <<
avd_node_down_appl_susi_failover
<143>1 2018-07-10T14:58:28.077075+10:00 SC-1 osafamfd 297 osafamfd
[meta sequenceId="22562"] 297:amf/amfd/ndproc.cc:1234 <<
avd_node_failover
*Best Regards,*
*Thuan *
*From:* nagen...@hasolutions.in <mailto:nagen...@hasolutions.in>
<nagen...@hasolutions.in> <mailto:nagen...@hasolutions.in>
*Sent:* Tuesday, July 10, 2018 1:36 PM
*To:* Minh Hon Chau <minh.c...@dektech.com.au>
<mailto:minh.c...@dektech.com.au>; Tran Thuan
<thuan.t...@dektech.com.au> <mailto:thuan.t...@dektech.com.au>;
hans.nordeb...@ericsson.com <mailto:hans.nordeb...@ericsson.com>;
gary....@dektech.com.au <mailto:gary....@dektech.com.au>
*Cc:* opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net>
*Subject:* RE: Re: [PATCH 1/1] amf: amfd should reset msg_id
counter to avoid message ID mismatch [#2891]
Hi Thuan,
I agree with Minh. We need to know why Node
left has come before and Mds down came later and about other
flows. I would be great if you could please reproduce the scenario
and upload the traces of payload and controller(Please keep the
time in sync) in the ticket.
Thanks,
Nagendra, 91-9866424860
www.hasolutions.in <http://www.hasolutions.in>
https://www.linkedin.com/company/hasolutions/
High Availability Solutions Pvt. Ltd.
- OpenSAF support and services
--------- Original Message ---------
Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id counter
to avoid message ID mismatch [#2891]
From: "Minh Hon Chau" <minh.c...@dektech.com.au
<mailto:minh.c...@dektech.com.au>>
Date: 7/10/18 10:49 am
To: "Tran Thuan" <thuan.t...@dektech.com.au
<mailto:thuan.t...@dektech.com.au>>, nagen...@hasolutions.in
<mailto:nagen...@hasolutions.in>, hans.nordeb...@ericsson.com
<mailto:hans.nordeb...@ericsson.com>, gary....@dektech.com.au
<mailto:gary....@dektech.com.au>
Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net>
Hi again,
I am not sure how you replicate this scenario. It looks to me you
deliberately reboot the PL-3 before amfd process the synced
assignment.
The node_up sequence has many branches, you are replicating
one of those.
What I mean, at the time amfd calls
avd_process_state_info_queue(), PL
has not joined the AMF cluster and PL keeps resending node_up
message
for every one second.
When amfd finishes creating synced assignment, the node sync
windows is
still open, the cb->init_state has not moved to APP_STATE, so the
node_up of PL is still acceptable. At this time a PL can be
reboot by
someone.
For your reference, I attach to ticket the full trace of amfd,
I have to
turn off PL-3 to trigger the node sync window up 10 seconds, I
reboot
PL-4. Please note this scenario of node sync timeout did
happen at
application sides.
Reseting the counters in CLM callback is easy way, but it's
not fully
fixed and not make us safe from the issue.
Thanks
Minh
On 10/07/18 14:00, Tran Thuan wrote:
> Hi Minh,
>
> I check your concern about if PL down before sending node up
but after synced assignments (recover after headless).
> Current code still work correctly. Because active AMFD don't
recreate assignments.
> So, my point of view is still there, just reset 3 fields as
current patch.
>
> Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
SISU:safSi=NoRed2,safApp=OpenSAF,safSu=PL-3,safSg=NoRed,safApp=OpenSAF
<1, 3>
> Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Synced
SU:safSu=PL-3,safSg=NoRed,safApp=OpenSAF <0, 1, 3>
> Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 CSICOMP states sent
> Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO 8 COMP states sent
> Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO Sending node up
due to NCSMDS_NEW_ACTIVE
> Jul 10 13:40:05 PL-2-3 osafamfnd[5971]: NO
====================delay to power off this node
====================
> Jul 10 13:40:05 PL-2-3 osafckptnd[5984]: NO CLM selection
object was updated. (17)
> Jul 10 13:40:06 PL-2-3 osafamfnd[5971]: NO
====================delay to power off this node
====================
> Jul 10 13:40:07 PL-2-3 osafamfnd[5971]: NO
====================delay to power off this node
====================
> Jul 10 13:40:08 PL-2-3 osafamfnd[5971]: NO
====================delay to power off this node
====================
>
> <143>1 2018-07-10T13:40:15.365252+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13233"] 4827:amf/amfd/ndfsm.cc:296
>> avd_node_up_evh: from 2010f,
safAmfNode=SC-1,safAmfCluster=myAmfCluster
> <143>1 2018-07-10T13:40:15.365259+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13234"] 4827:amf/amfd/node.cc:231
>> avd_node_get: safAmfNode=SC-1,safAmfCluster=myAmfCluster
> <143>1 2018-07-10T13:40:15.365263+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13235"] 4827:amf/amfd/node.cc:232
<< avd_node_get
> <143>1 2018-07-10T13:40:15.365314+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13236"] 4827:amf/amfd/ndfsm.cc:216
>> avd_count_node_up
> <143>1 2018-07-10T13:40:15.36532+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13237"] 4827:amf/amfd/ndfsm.cc:226
TR Number of node director(s) that director received node_up msg:0
> <143>1 2018-07-10T13:40:15.365325+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13238"] 4827:amf/amfd/ndfsm.cc:228
<< avd_count_node_up
> <143>1 2018-07-10T13:40:15.365329+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13239"] 4827:amf/amfd/ndfsm.cc:353
TR NodeSync timer is active, ignore this node_up msg
(nodeid:2010f)
> <143>1 2018-07-10T13:40:15.365334+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13240"] 4827:amf/amfd/ndfsm.cc:534
<< avd_node_up_evh
>
> <143>1 2018-07-10T13:40:16.462939+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13280"] 4827:amf/amfd/ndfsm.cc:296
>> avd_node_up_evh: from 2010f,
safAmfNode=SC-1,safAmfCluster=myAmfCluster
> <141>1 2018-07-10T13:40:16.462975+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13281"] 4827:amf/amfd/ndfsm.cc:254
NO Received node_up from 2010f: msg_id 1
> <143>1 2018-07-10T13:40:16.462981+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13282"] 4827:amf/amfd/ndfsm.cc:63
>> avd_process_state_info_queue
> <143>1 2018-07-10T13:40:16.462986+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13283"] 4827:amf/amfd/ndfsm.cc:65
TR queue_size before processing: 2
> <143>1 2018-07-10T13:40:16.463162+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13284"] 4827:amf/amfd/ndfsm.cc:73
TR rcv_evt: 12
> <143>1 2018-07-10T13:40:16.463175+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13285"] 4827:amf/amfd/ndfsm.cc:79
TR msg_type: 31
> <143>1 2018-07-10T13:40:16.463203+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13286"] 4827:amf/amfd/siass.cc:1132
>> avd_susi_recreate: msg_id: 0 node_id: 131855 num_sisu: 3
> <139>1 2018-07-10T13:40:16.463224+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13287"] 4827:amf/amfd/siass.cc:1140
ER Node 2030f has left the cluster
> <143>1 2018-07-10T13:40:16.463258+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13288"] 4827:amf/amfd/siass.cc:0 <<
avd_susi_recreate
> <143>1 2018-07-10T13:40:16.463273+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13289"]
4827:amf/common/d2nmsg.c:931 >> avsv_free_n2d_nd_sisu_state_info
> <143>1 2018-07-10T13:40:16.463278+10:00 SC-2-1 osafamfd 4827
osafamfd [meta sequenceId="13290"]
4827:amf/common/d2nmsg.c:953 TR 3 SISU records to free
>
> Best Regards,
> Thuan
>
> -----Original Message-----
> From: Minh Hon Chau <minh.c...@dektech.com.au
<mailto:minh.c...@dektech.com.au>>
> Sent: Monday, July 9, 2018 4:04 PM
> To: Tran Thuan <thuan.t...@dektech.com.au
<mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
<mailto:nagen...@hasolutions.in>; hans.nordeb...@ericsson.com
<mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
<mailto:gary....@dektech.com.au>
> Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net>
> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
counter to avoid message ID mismatch [#2891]
>
> Hi,
>
> How about the @nodes_exit_cnt? The active amfd checkpoints
this @nodes_exit_cnt to the standby amfd inside mds down
event. You can say it's not used but it's what's happening for
this counter.
>
> "AMFND not yet up", what you mean is node_state
AVD_AVND_STATE_ABSENT?
> You reset counter if node state as AVD_AVND_STATE_ABSENT in
clm callback.
> If a payload is coming from headless, it syncs the
assignments to the active amfd first and enters node_up
sequence. This node_up sequence can take up to 10 seconds
before this payload becomes AVD_AVND_STATE_PRESENT.
> If this issue happens during this 10 seconds, the node_state
is still AVD_AVND_STATE_ABSENT and active amfd does not
process mds down event properly, how could amfd remove the
synced assignment?
>
> As mentioned, the problem is not that the active amfd is
missing 1, 2 or
> 3 counters to be reset, it misses to process the mds down
event properly.
>
> Are you able to reproduce this problem?
>
> Thanks
> Minh
>
> On 09/07/18 16:33, Tran Thuan wrote:
>> Hi Minh,
>>
>> Since AMFND not yet up before, other fields are not updated
yet.
>> I think only 3 fields is changed in this scenario, that's
why my is reset them only.
>> Do you think so?
>>
>> Best Regards,
>> Thuan
>>
>> -----Original Message-----
>> From: Minh Hon Chau <minh.c...@dektech.com.au
<mailto:minh.c...@dektech.com.au>>
>> Sent: Monday, July 9, 2018 1:22 PM
>> To: Tran Thuan <thuan.t...@dektech.com.au
<mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
<mailto:nagen...@hasolutions.in>;
>> hans.nordeb...@ericsson.com
<mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
<mailto:gary....@dektech.com.au>
>> Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net>
>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
counter to
>> avoid message ID mismatch [#2891]
>>
>> Hi,
>>
>> Ok, so we can ignore the "AMFND already up" case.
>>
>> "- AMFND not yet up before (node reboot up/first up): this
is our issue.
>> + current only delete node without care msg_id counter."
>>
>> Are you able to reproduce this problem?
>> You only reset the msg_id counter of a node, how about the
others:
>> adest, recvr_fail_sw, node_info.initialViewNumber,
nodes_exit_cnt,..
>> they all are checked in the other places.
>> Also, when the node goes down, the standby resets all of
above variables directly, while the active doesn't as of this
issue or only resets msg_id counters within your patch. The
active and standby may not be consistent.
>>
>> Another thing you may consider, if any new/old
attributes/variables of AVD_AVND is going to be added/removed,
we only look at one place - event of MDS down, don't have to
look at/reset those ones in the CLM callback.
>>
>> Thanks
>> Minh
>>
>> On 09/07/18 12:12, Tran Thuan wrote:
>>> Hi Minh,
>>>
>>> OK, so CLM callback can handle before/after AMFND down.
>>> In case AMFND down then CLM callback, current flow works
correctly.
>>> In case CLM callback then AMFND down:
>>> - AMFND already up before: current flow will do
clm_node_exit_complete(), still works correctly.
>>> - AMFND not yet up before (node reboot up/first up): this
is our issue.
>>> + current only delete node without care msg_id counter.
>>>
>>> Best Regards,
>>> Thuan
>>>
>>> -----Original Message-----
>>> From: Minh Hon Chau <minh.c...@dektech.com.au
<mailto:minh.c...@dektech.com.au>>
>>> Sent: Monday, July 9, 2018 8:52 AM
>>> To: Tran Thuan <thuan.t...@dektech.com.au
<mailto:thuan.t...@dektech.com.au>>; nagen...@hasolutions.in
<mailto:nagen...@hasolutions.in>;
>>> hans.nordeb...@ericsson.com
<mailto:hans.nordeb...@ericsson.com>; gary....@dektech.com.au
<mailto:gary....@dektech.com.au>
>>> Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net>
>>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
counter to
>>> avoid message ID mismatch [#2891]
>>>
>>> This problem possibly happens in normal situation, not
only "AMFND not yet up", so seeing it as a race and can also
fixes the problem in case of "AMFND already up"
>>>
>>>
>>> On 09/07/18 11:41, Minh Hon Chau wrote:
>>>> Hi Thuan,
>>>>
>>>> I often see mds down coming before clm callback in my
test, the
>>>> problem in this ticket has those events coming reverse order.
>>>>
>>>> <143>1 2018-07-09T11:19:22.185279+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23028"] 372:amf/amfd/ndfsm.cc:772 >>
>>>> avd_mds_avnd_down_evh: 2030f, 0x55895024e410
>>>> <143>1 2018-07-09T11:19:22.185303+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23029"] 372:amf/amfd/ndproc.cc:1219 >>
>>>> avd_node_failover:
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
>>>> <143>1 2018-07-09T11:19:22.185324+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23030"] 372:amf/amfd/ndfsm.cc:1111 >>
>>>> avd_node_mark_absent ...
>>>> <143>1 2018-07-09T11:19:22.253699+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23890"] 372:amf/amfd/ndfsm.cc:821 <<
>>>> avd_mds_avnd_down_evh
>>>>
>>>> <143>1 2018-07-09T11:19:22.272188+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23917"] 372:clm/agent/clma_api.cc:786 >>
>>>> saClmDispatch ...
>>>> <143>1 2018-07-09T11:19:22.272268+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23926"] 372:amf/amfd/clm.cc:223 >>
clm_track_cb: '0'
>>>> '4' '1'
>>>> <143>1 2018-07-09T11:19:22.272277+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23927"] 372:amf/amfd/clm.cc:239 TR
>>>> numberOfMembers:'4', numberOfItems:'1'
>>>> <143>1 2018-07-09T11:19:22.272285+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23928"] 372:amf/amfd/clm.cc:245 TR i = 0,
>>>> node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
>>>> <142>1 2018-07-09T11:19:22.272294+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23929"] 372:amf/amfd/clm.cc:320 IN
clm_track_cb:
>>>> CLM node 'safNode=PL-3,safCluster=myClmCluster' is not an AMF
>>>> cluster member; MDS down received
>>>> <143>1 2018-07-09T11:19:22.272303+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23930"] 372:amf/amfd/clm.cc:443 <<
clm_track_cb ...
>>>> <143>1 2018-07-09T11:19:22.272352+10:00 SC-1 osafamfd 372
osafamfd
>>>> [meta sequenceId="23936"] 372:clm/agent/clma_api.cc:834 <<
>>>> saClmDispatch
>>>>
>>>>
>>>> Reset counter can avoid the msg mismatch, but I'm not
sure it's
>>>> completely fixed since the mds down event is not handled
properly.
>>>> If you have mds down event handled properly in both
cases, will
>>>> there be this msg mismatch problem?
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 09/07/18 11:11, Tran Thuan wrote:
>>>>> Hi Minh,
>>>>>
>>>>> If I understand correctly, CLM callback always before
AMFND down/up
>>>>> event.
>>>>> The problem is just because AMFND not yet up and CLM
callback
>>>>> receive MDS down of node and delete node.
>>>>> After that AMFND down event come cannot find the node,
cannot find
>>>>> the node is expected I guess.
>>>>> So, we only need to reset counter to make thing right,
what do you
>>>>> think?
>>>>>
>>>>> <143>1 2018-07-03T18:40:27.239707+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2448"]
4815:amf/amfd/clm.cc:223 >> clm_track_cb:
>>>>> '0' '4' '1'
>>>>> <142>1 2018-07-03T18:40:27.239778+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2451"]
4815:amf/amfd/clm.cc:299 IN clm_track_cb:
>>>>> CLM node 'safNode=PL-2-3,safCluster=myClmCluster' is not
an AMF
>>>>> cluster member; MDS down received
>>>>> <143>1 2018-07-03T18:40:27.239792+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2452"]
4815:amf/amfd/clm.cc:418 <<
>>>>> clm_track_cb
>>>>> <143>1 2018-07-03T18:40:27.243287+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2465"]
4815:amf/amfd/ndfsm.cc:772 >>
>>>>> avd_mds_avnd_down_evh: 2030f, (nil)
>>>>> <143>1 2018-07-03T18:40:27.243307+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2466"]
4815:amf/amfd/ndfsm.cc:821 <<
>>>>> avd_mds_avnd_down_evh
>>>>>
>>>>> <143>1 2018-07-03T18:41:37.965003+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2596"]
4815:amf/amfd/clm.cc:223 >> clm_track_cb:
>>>>> '0' '4' '1'
>>>>> <143>1 2018-07-03T18:41:37.965104+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2605"]
4815:amf/amfd/clm.cc:418 <<
>>>>> clm_track_cb
>>>>> <143>1 2018-07-03T18:41:45.016379+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2631"]
4815:amf/amfd/ndfsm.cc:750 TR
>>>>> amfnd on 2030f is up
>>>>> <143>1 2018-07-03T18:41:45.139262+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="2686"]
4815:amf/amfd/ndfsm.cc:772 >>
>>>>> avd_mds_avnd_down_evh: 2030f, 0x7ff9ac3dc940
>>>>> <143>1 2018-07-03T18:41:45.141883+10:00 SC-2-1 osafamfd 4815
>>>>> osafamfd [meta sequenceId="3008"]
4815:amf/amfd/ndfsm.cc:821 <<
>>>>> avd_mds_avnd_down_evh
>>>>>
>>>>> Best Regards,
>>>>> Thuan
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Minh Hon Chau <minh.c...@dektech.com.au
<mailto:minh.c...@dektech.com.au>>
>>>>> Sent: Monday, July 9, 2018 7:15 AM
>>>>> To: thuan.tran <thuan.t...@dektech.com.au
<mailto:thuan.t...@dektech.com.au>>;
>>>>> nagen...@hasolutions.in
<mailto:nagen...@hasolutions.in>; hans.nordeb...@ericsson.com
<mailto:hans.nordeb...@ericsson.com>;
>>>>> gary....@dektech.com.au <mailto:gary....@dektech.com.au>
>>>>> Cc: opensaf-devel@lists.sourceforge.net
<mailto:opensaf-devel@lists.sourceforge.net>
>>>>> Subject: Re: [PATCH 1/1] amf: amfd should reset msg_id
counter to
>>>>> avoid message ID mismatch [#2891]
>>>>>
>>>>> Hi,
>>>>>
>>>>> In normal situation when I stop Opensaf or reboot the
node, I often
>>>>> get MDS down first, then CLM track callback. [1]
>>>>>
>>>>> The problem in this ticket is because AMFD has CLM track
callback
>>>>> coming before MDS down. [2]
>>>>>
>>>>> We need to see it as a race condition, not because of
missing the
>>>>> counters reset, so that the amfd's function to deal with
MDS down
>>>>> should be called. That would be equivalent to [1], which is
>>>>> currently working.
>>>>>
>>>>> With this patch, we still don't have
avd_mds_avnd_down_evh() to be
>>>>> called, that function processes many things in the event
of mds down.
>>>>> (ie. @nodes_exit_cnt, ...)
>>>>>
>>>>> I think the node_info.member should be set TRUE or FALSE
in the
>>>>> event of CLM callback since it means CLM membership. We
can use
>>>>> both @member and one of variables indicating MDS down,
to trap each
>>>>> other in the events of MDS down and CLM callback, so
that in the
>>>>> end we have
>>>>> avd_mds_avnd_down_evh() and nodeid is removed out of
node_id_db in
>>>>> both [1] and [2].
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>> On 06/07/18 20:36, thuan.tran wrote:
>>>>>> There is a case that after AMFD send reboot order due
to “out of
>>>>>> sync window”.
>>>>>> AMFD receive CLM track callback but node is not member
yet and
>>>>>> delete node.
>>>>>> Later AMFND MDS down will not reset msg_id counter
since it cannot
>>>>>> find node.
>>>>>> When node reboot up, AMFD continue use current msg_id
counter send
>>>>>> to AMFND cause messasge ID mismatch in AMFND then AMFND
order
>>>>>> reboot itself node.
>>>>>> ---
>>>>>> src/amf/amfd/clm.cc | 7 ++++++-
>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc
index
>>>>>> e113a65f9..4a15d5ad7 100644
>>>>>> --- a/src/amf/amfd/clm.cc
>>>>>> +++ b/src/amf/amfd/clm.cc
>>>>>> @@ -316,9 +316,14 @@ static void clm_track_cb(
>>>>>> __FUNCTION__, node_name.c_str());
>>>>>> goto done;
>>>>>> } else if (node->node_state == AVD_AVND_STATE_ABSENT) {
>>>>>> - LOG_IN("%s: CLM node '%s' is not an AMF cluster member;
>>>>>> MDS down received",
>>>>>> + LOG_IN("%s: CLM node '%s' is ABSENT; MDS down
>>>>>> +received",
>>>>>> __FUNCTION__, node_name.c_str());
>>>>>> avd_node_delete_nodeid(node);
>>>>>> + /* Reset msg_id because AVND MDS down may come later
>>>>>> + and cannot find node to reset these, cause message
>>>>>> + ID
>>>>>> mismatch. */
>>>>>> + node->rcv_msg_id = 0;
>>>>>> + node->snd_msg_id = 0;
>>>>>> + node->node_info.member = SA_FALSE;
>>>>>> goto done;
>>>>>> }
>>>>>> TRACE(" Node Left: rootCauseEntity %s for node %u",
>>
>
>