Hi Nagu, I had tested the remote fencing, it works and no multiple reboot to PL.
Jul 9 14:48:27 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid msg id 45, msg type 8, from 2030f should be 1 Jul 9 14:48:27 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: reboot node 2030f to recover it Jul 9 14:48:27 SC-2-1 osafamfd[4594]: Rebooting OpenSAF NodeId = 131855 EE Name = PL-2-3, Reason: Fencing remote node, OwnNodeId = 131343, SupervisionTime = 60 Jul 9 14:48:27 SC-2-1 systemd[1]: Starting Session c3 of user root. Jul 9 14:48:27 SC-2-1 systemd[1]: Started Session c3 of user root. Jul 9 14:48:27 SC-2-1 external/libvirt[8939]: notice: Domain FT-REG-12-PL-3 was stopped Jul 9 14:48:30 SC-2-1 external/libvirt[8939]: notice: Domain FT-REG-12-PL-3 was started Jul 9 14:48:31 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid node ID (2030f) Jul 9 14:48:31 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid node ID (2030f) Best Regards, Thuan -----Original Message----- From: nagen...@hasolutions.in <nagen...@hasolutions.in> Sent: Tuesday, July 24, 2018 1:24 PM To: opensaf-devel@lists.sourceforge.net Subject: Re: [devel] [PATCH 1/1] amf: Recover node that disconnnect from active AMFD [#2880] Hi Gary, Thanks for reminding me. My guess was if there is node leave at SC-1, then there should be node leave at PL-16 as well. And that need to be debug. The patch looks ok to me as well except it will result in sending multiple reboot command to PL-16 if there are more messages from PL-16. Thanks, Nagendra, 91-9866424860 www.hasolutions.in https://www.linkedin.com/company/hasolutions/ High Availability Solutions Pvt. Ltd. - OpenSAF Support and Services --------- Original Message --------- Subject: Re: [PATCH 1/1] amf: Recover node that disconnnect from active AMFD [#2880] From: "Gary Lee" <gary....@dektech.com.au> Date: 7/24/18 10:01 am To: "thuan.tran" <thuan.t...@dektech.com.au>, hans.nordeb...@ericsson.com Cc: opensaf-devel@lists.sourceforge.net, nagen...@hasolutions.in Hi Nagu Do you have any comments on this? It seems OK to me, but I know you've worked on similar scenarios with TIPC flickering before, where reboot is issued from the PL side. Thanks Gary On 09/07/18 16:37, thuan.tran wrote: > There is a abnormal state that AMFND on remote node keep sending > message to active AMFD but active AMFD see that node already left. > The msg_id expected is not matched and the remote node keep stuck > as out of control of active AMFD. > In this case, active AMFD can trigger remote fencing for that node > if possible, otherwise send reboot order directly. > --- > src/amf/amfd/ndfsm.cc | 2 -- > src/amf/amfd/ndproc.cc | 16 ++++++++++++++++ > 2 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/src/amf/amfd/ndfsm.cc b/src/amf/amfd/ndfsm.cc > index 9d54df13d..2d407be12 100644 > --- a/src/amf/amfd/ndfsm.cc > +++ b/src/amf/amfd/ndfsm.cc > @@ -796,7 +796,6 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) { > */ > node->node_state = AVD_AVND_STATE_ABSENT; > node->saAmfNodeOperState = SA_AMF_OPERATIONAL_DISABLED; > - node->adest = 0; > node->rcv_msg_id = 0; > node->snd_msg_id = 0; > node->recvr_fail_sw = false; > @@ -1115,7 +1114,6 @@ void avd_node_mark_absent(AVD_AVND *node) { > > LOG_NO("Node '%s' left the cluster", node->node_name.c_str()); > > - node->adest = 0; > node->rcv_msg_id = 0; > node->snd_msg_id = 0; > node->recvr_fail_sw = false; > diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc > index 428c26085..31d2263d2 100644 > --- a/src/amf/amfd/ndproc.cc > +++ b/src/amf/amfd/ndproc.cc > @@ -73,6 +73,22 @@ AVD_AVND *avd_msg_sanity_chk(AVD_EVT *evt, SaClmNodeIdT node_id, > LOG_WA("%s: invalid msg id %u, msg type %u, from %x should be %u", > __FUNCTION__, msg_id, evt->info.avnd_msg->msg_type, node_id, > node->rcv_msg_id + 1); > + if (node->rcv_msg_id == 0) { > + /* Active AMFD see node left but node still see active AMFD > + and keep sending messages with msg_id increment */ > + LOG_WA("%s: reboot node %x to recover it", __FUNCTION__, node_id); > + Consensus consensus_service; > + if (consensus_service.IsRemoteFencingEnabled() == true) { > + std::string host_name = > + osaf_extended_name_borrow(&node->node_info.nodeName); > + int first = host_name.find_first_of("=") + 1; > + int end = host_name.find_first_of(","); > + host_name = host_name.substr(first, end-first); > + opensaf_reboot(node_id, host_name.c_str(), "Fencing remote node"); > + } else { > + avd_send_reboot_msg_directly(node); > + } > + } > return nullptr; > } > ---------------------------------------------------------------------------- -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel