Hi Gary,
Thanks for reminding me. My guess was if there is node leave at SC-1, then 
there should be node leave at PL-16 as well. And that need to be debug.
The patch looks ok to me as well except it will result in sending multiple 
reboot command to PL-16 if there are more messages from PL-16.
 
Thanks,
Nagendra, 91-9866424860
www.hasolutions.in
https://www.linkedin.com/company/hasolutions/
High Availability Solutions Pvt. Ltd.
- OpenSAF Support and Services
 
--------- Original Message --------- Subject: Re: [PATCH 1/1] amf: Recover node 
that disconnnect from active AMFD [#2880]
From: "Gary Lee" <gary....@dektech.com.au>
Date: 7/24/18 10:01 am
To: "thuan.tran" <thuan.t...@dektech.com.au>, hans.nordeb...@ericsson.com
Cc: opensaf-devel@lists.sourceforge.net, nagen...@hasolutions.in

Hi Nagu
 
 Do you have any comments on this? It seems OK to me, but I know you've 
 worked on similar scenarios with TIPC flickering before, where reboot is 
 issued from the PL side.
 
 Thanks
 Gary
 
 On 09/07/18 16:37, thuan.tran wrote:
 > There is a abnormal state that AMFND on remote node keep sending
 > message to active AMFD but active AMFD see that node already left.
 > The msg_id expected is not matched and the remote node keep stuck
 > as out of control of active AMFD.
 > In this case, active AMFD can trigger remote fencing for that node
 > if possible, otherwise send reboot order directly.
 > ---
 > src/amf/amfd/ndfsm.cc | 2 --
 > src/amf/amfd/ndproc.cc | 16 ++++++++++++++++
 > 2 files changed, 16 insertions(+), 2 deletions(-)
 >
 > diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc
 > index 9d54df13d..2d407be12 100644
 > --- a/src/amf/amfd/ndfsm.cc
 > +++ b/src/amf/amfd/ndfsm.cc
 > @@ -796,7 +796,6 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
 > */
 > node->node_state = AVD_AVND_STATE_ABSENT;
 > node->saAmfNodeOperState = SA_AMF_OPERATIONAL_DISABLED;
 > - node->adest = 0;
 > node->rcv_msg_id = 0;
 > node->snd_msg_id = 0;
 > node->recvr_fail_sw = false;
 > @@ -1115,7 +1114,6 @@ void avd_node_mark_absent(AVD_AVND *node) {
 > 
 > LOG_NO("Node '%s' left the cluster", node->node_name.c_str());
 > 
 > - node->adest = 0;
 > node->rcv_msg_id = 0;
 > node->snd_msg_id = 0;
 > node->recvr_fail_sw = false;
 > diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc
 > index 428c26085..31d2263d2 100644
 > --- a/src/amf/amfd/ndproc.cc
 > +++ b/src/amf/amfd/ndproc.cc
 > @@ -73,6 +73,22 @@ AVD_AVND *avd_msg_sanity_chk(AVD_EVT *evt, SaClmNodeIdT 
 > node_id,
 > LOG_WA("%s: invalid msg id %u, msg type %u, from %x should be %u",
 > __FUNCTION__, msg_id, evt->info.avnd_msg->msg_type, node_id,
 > node->rcv_msg_id + 1);
 > + if (node->rcv_msg_id == 0) {
 > + /* Active AMFD see node left but node still see active AMFD
 > + and keep sending messages with msg_id increment */
 > + LOG_WA("%s: reboot node %x to recover it", __FUNCTION__, node_id);
 > + Consensus consensus_service;
 > + if (consensus_service.IsRemoteFencingEnabled() == true) {
 > + std::string host_name =
 > + osaf_extended_name_borrow(&node->node_info.nodeName);
 > + int first = host_name.find_first_of("=") + 1;
 > + int end = host_name.find_first_of(",");
 > + host_name = host_name.substr(first, end-first);
 > + opensaf_reboot(node_id, host_name.c_str(), "Fencing remote node");
 > + } else {
 > + avd_send_reboot_msg_directly(node);
 > + }
 > + }
 > return nullptr;
 > }
 >
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to