Hi Thang,
I can see it's a race in main thread that how amfd processes the mds
down and clm callback.
Node is going down
<143>1 2019-06-11T15:16:42.157517+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38507"] 275:amf/amfd/mds.cc:398 >> avd_mds_svc_evt
<143>1 2019-06-11T15:16:42.157526+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38508"] 275:amf/amfd/mds.cc:459 TR avnd 2030f000000bd down
<143>1 2019-06-11T15:16:42.157535+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38509"] 275:amf/amfd/mds.cc:0 << avd_mds_svc_evt
<143>1 2019-06-11T15:16:48.332481+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38623"] 272:amf/amfd/clm.cc:226 >> clm_track_cb: '0' '4' '1'
<143>1 2019-06-11T15:16:48.33249+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38624"] 272:amf/amfd/clm.cc:242 TR numberOfMembers:'4',
numberOfItems:'1'
<143>1 2019-06-11T15:16:48.3325+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38625"] 272:amf/amfd/clm.cc:248 TR i = 0,
node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:3
<143>1 2019-06-11T15:16:48.33251+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38626"] 272:amf/amfd/clm.cc:332 TR Node Left:
rootCauseEntity safNode=PL-3,safCluster=myClmCluster for node 131855
<143>1 2019-06-11T15:16:48.332519+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38627"] 272:amf/amfd/clm.cc:188 >> clm_node_exit_complete: 2030f
<143>1 2019-06-11T15:16:48.332534+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38628"] 272:amf/amfd/ndproc.cc:1267 >> avd_node_failover:
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
<143>1 2019-06-11T15:16:48.332545+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="38629"] 272:amf/amfd/ndfsm.cc:1153 >> avd_node_mark_absent
Node is coming up again
<143>1 2019-06-11T15:16:48.34867+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39826"] 272:amf/amfd/clm.cc:226 >> clm_track_cb: '0' '4' '1'
<143>1 2019-06-11T15:16:48.348674+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39827"] 272:amf/amfd/clm.cc:242 TR numberOfMembers:'5',
numberOfItems:'1'
<143>1 2019-06-11T15:16:48.348678+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39828"] 272:amf/amfd/clm.cc:248 TR i = 0,
node:'safNode=PL-3,safCluster=myClmCluster', clusterChange:2
<143>1 2019-06-11T15:16:48.348685+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39829"] 272:amf/amfd/node.cc:53 TR added node 131855
<143>1 2019-06-11T15:16:48.348689+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39830"] 272:amf/amfd/clm.cc:417 TR Node Joined
'safNode=PL-3,safCluster=myClmCluster' '36'
Now amfd processes the mds down in main thread, its a race here then the
@node_info.member set to FALSE
<143>1 2019-06-11T15:16:48.351948+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39971"] 272:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh:
2030f, 0x558e549a1650
<143>1 2019-06-11T15:16:48.351954+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39972"] 272:amf/amfd/ndproc.cc:1267 >> avd_node_failover:
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
<143>1 2019-06-11T15:16:48.351959+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="39973"] 272:amf/amfd/ndfsm.cc:1153 >> avd_node_mark_absent
Now the mds up comes, node_up come, but the node is not a clm member
<143>1 2019-06-11T15:16:48.701771+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40552"] 275:amf/amfd/mds.cc:398 >> avd_mds_svc_evt
<143>1 2019-06-11T15:16:48.701791+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40553"] 275:amf/amfd/mds.cc:0 << avd_mds_svc_evt
<143>1 2019-06-11T15:16:48.706254+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40560"] 272:amf/amfd/ndfsm.cc:743 >> avd_mds_avnd_up_evh
<143>1 2019-06-11T15:16:48.706271+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40561"] 275:amf/amfd/ndmsg.cc:389 << avd_n2d_msg_rcv
<143>1 2019-06-11T15:16:48.706288+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40562"] 272:amf/amfd/ndfsm.cc:757 TR amfnd on 2030f is up
<143>1 2019-06-11T15:16:48.706298+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40563"] 272:amf/amfd/ndfsm.cc:0 << avd_mds_avnd_up_evh
<143>1 2019-06-11T15:16:48.707145+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40596"] 272:amf/amfd/ndfsm.cc:275 >> avd_node_up_evh: from
2030f, safAmfNode=PL-3,safAmfCluster=myAmfCluster
<143>1 2019-06-11T15:16:48.707153+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40597"] 272:amf/amfd/ndfsm.cc:292 TR leds_set 0
<143>1 2019-06-11T15:16:48.70716+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40598"] 272:amf/amfd/ndfsm.cc:308 TR node_id '2030f' not in
failover_list.
<141>1 2019-06-11T15:16:48.707185+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40599"] 272:amf/amfd/ndfsm.cc:232 NO Received node_up from
2030f: msg_id 1
<140>1 2019-06-11T15:16:48.7072+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40600"] 272:amf/amfd/ndfsm.cc:387 WA Not a Cluster Member
dropping the msg
<143>1 2019-06-11T15:16:48.707206+10:00 SC-1 osafamfd 272 osafamfd [meta
sequenceId="40601"] 272:amf/amfd/ndfsm.cc:533 << avd_node_up_evh
The patch looks OK, I'm doing a bit more testing and will get back to you.
Thanks
Minh
On 7/5/19 5:41 am, thang.d.nguyen wrote:
When the PBE hung, amfd can process the events with
below order when a node was started then stop then started
- clm_track_cb for node down event
- clm_track_cb for second node up event
- avd_mds_avnd_down_evh was called to process amfnd down event
And it cause the node can not join the cluster.
---
src/amf/amfd/ndfsm.cc | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc
index 8c8f3c5..7099196 100644
--- a/src/amf/amfd/ndfsm.cc
+++ b/src/amf/amfd/ndfsm.cc
@@ -800,6 +800,13 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
daemon_exit();
}
+ if (node->node_state == AVD_AVND_STATE_ABSENT) {
+ TRACE("Ignore '%s' amfnd down event since node state absent",
+ node->node_name.c_str());
+ TRACE_LEAVE();
+ return;
+ }
+
if (cb->failover_list.find(evt->info.node_id) != cb->failover_list.end())
{
std::shared_ptr<NodeStateMachine> failed_node =
cb->failover_list.at(evt->info.node_id);
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel