Hi Gary,

I failed to test remote fencing to PL-2-3 as following:

Jul  6 10:42:32 SC-2-1 osafamfd[4573]: WA avd_msg_sanity_chk: invalid msg id
45, msg type 8, from 2030f should be 1
Jul  6 10:42:32 SC-2-1 osafamfd[4573]: WA avd_msg_sanity_chk: reboot node
2030f to recover it
Jul  6 10:42:32 SC-2-1 osafamfd[4573]: Rebooting OpenSAF NodeId = 131855 EE
Name = safNode=PL-2-3,safCluster=myClmCluster, Reason: Fencing remote node,
OwnNodeId = 131343, SupervisionTime = 60
Jul  6 10:42:32 SC-2-1 cmwea: tipc-address-get failed, unknown node:
safNode=PL-2-3,safCluster=myClmCluster

I try to find out the way to get "PL-2-3" but seems nowhere do similar
thing.
Moreover, other place calls opensaf_reboot() use
osaf_extended_name_borrow(&node->node_info.executionEnvironment) but I check
code and think that will be null.
If I tried to get "PL-2-3" from nodeName, my code line is totally different
with other place calls opensaf_reboot()
Last thing, since AMFD can get msg from AMFND, why don't just use
avd_send_reboot_msg_directly()?

Best Regards,
Thuan

-----Original Message-----
From: thuan.tran <thuan.t...@dektech.com.au> 
Sent: Tuesday, June 26, 2018 1:35 PM
To: hans.nordeb...@ericsson.com; gary....@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net; thuan.tran
<thuan.t...@dektech.com.au>
Subject: [PATCH 1/1] amf: Recover node that disconnnect from active AMFD
[#2880]

There is a abnormal state that AMFND on remote node keep sending message to
active AMFD but active AMFD see that node already left.
The msg_id expected is not matched and the remote node keep stuck as out of
control of active AMFD.
In this case, active AMFD can trigger remote fencing for that node if
possible, otherwise send reboot order directly.
---
 src/amf/amfd/ndfsm.cc  |  2 --
 src/amf/amfd/ndproc.cc | 13 +++++++++++++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc index
9d54df13d..2d407be12 100644
--- a/src/amf/amfd/ndfsm.cc
+++ b/src/amf/amfd/ndfsm.cc
@@ -796,7 +796,6 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt)
{
        */
       node->node_state = AVD_AVND_STATE_ABSENT;
       node->saAmfNodeOperState = SA_AMF_OPERATIONAL_DISABLED;
-      node->adest = 0;
       node->rcv_msg_id = 0;
       node->snd_msg_id = 0;
       node->recvr_fail_sw = false;
@@ -1115,7 +1114,6 @@ void avd_node_mark_absent(AVD_AVND *node) {
 
   LOG_NO("Node '%s' left the cluster", node->node_name.c_str());
 
-  node->adest = 0;
   node->rcv_msg_id = 0;
   node->snd_msg_id = 0;
   node->recvr_fail_sw = false;
diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc index
428c26085..629b2ddc2 100644
--- a/src/amf/amfd/ndproc.cc
+++ b/src/amf/amfd/ndproc.cc
@@ -73,6 +73,19 @@ AVD_AVND *avd_msg_sanity_chk(AVD_EVT *evt, SaClmNodeIdT
node_id,
     LOG_WA("%s: invalid msg id %u, msg type %u, from %x should be %u",
            __FUNCTION__, msg_id, evt->info.avnd_msg->msg_type, node_id,
            node->rcv_msg_id + 1);
+    if (node->rcv_msg_id == 0) {
+      /* Active AMFD see node left but node still see active AMFD
+      and keep sending messages with msg_id increment */
+      LOG_WA("%s: reboot node %x to recover it", __FUNCTION__, node_id);
+      Consensus consensus_service;
+      if (consensus_service.IsRemoteFencingEnabled() == true) {
+        opensaf_reboot(node_id,
+                      osaf_extended_name_borrow(&node->node_info.nodeName),
+                      "Fencing remote node");
+      } else {
+        avd_send_reboot_msg_directly(node);
+      }
+    }
     return nullptr;
   }
 
--
2.18.0



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to