Hi Nagu
Do you have any comments on this? It seems OK to me, but I know you've
worked on similar scenarios with TIPC flickering before, where reboot is
issued from the PL side.
Thanks
Gary
On 09/07/18 16:37, thuan.tran wrote:
There is a abnormal state that AMFND on remote node keep sending
message to active AMFD but active AMFD see that node already left.
The msg_id expected is not matched and the remote node keep stuck
as out of control of active AMFD.
In this case, active AMFD can trigger remote fencing for that node
if possible, otherwise send reboot order directly.
---
src/amf/amfd/ndfsm.cc | 2 --
src/amf/amfd/ndproc.cc | 16 ++++++++++++++++
2 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc
index 9d54df13d..2d407be12 100644
--- a/src/amf/amfd/ndfsm.cc
+++ b/src/amf/amfd/ndfsm.cc
@@ -796,7 +796,6 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
*/
node->node_state = AVD_AVND_STATE_ABSENT;
node->saAmfNodeOperState = SA_AMF_OPERATIONAL_DISABLED;
- node->adest = 0;
node->rcv_msg_id = 0;
node->snd_msg_id = 0;
node->recvr_fail_sw = false;
@@ -1115,7 +1114,6 @@ void avd_node_mark_absent(AVD_AVND *node) {
LOG_NO("Node '%s' left the cluster", node->node_name.c_str());
- node->adest = 0;
node->rcv_msg_id = 0;
node->snd_msg_id = 0;
node->recvr_fail_sw = false;
diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc
index 428c26085..31d2263d2 100644
--- a/src/amf/amfd/ndproc.cc
+++ b/src/amf/amfd/ndproc.cc
@@ -73,6 +73,22 @@ AVD_AVND *avd_msg_sanity_chk(AVD_EVT *evt, SaClmNodeIdT
node_id,
LOG_WA("%s: invalid msg id %u, msg type %u, from %x should be %u",
__FUNCTION__, msg_id, evt->info.avnd_msg->msg_type, node_id,
node->rcv_msg_id + 1);
+ if (node->rcv_msg_id == 0) {
+ /* Active AMFD see node left but node still see active AMFD
+ and keep sending messages with msg_id increment */
+ LOG_WA("%s: reboot node %x to recover it", __FUNCTION__, node_id);
+ Consensus consensus_service;
+ if (consensus_service.IsRemoteFencingEnabled() == true) {
+ std::string host_name =
+ osaf_extended_name_borrow(&node->node_info.nodeName);
+ int first = host_name.find_first_of("=") + 1;
+ int end = host_name.find_first_of(",");
+ host_name = host_name.substr(first, end-first);
+ opensaf_reboot(node_id, host_name.c_str(), "Fencing remote node");
+ } else {
+ avd_send_reboot_msg_directly(node);
+ }
+ }
return nullptr;
}
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel