Hi Thuan,
 
In case when there are more than one node up events pending in the mail box, 
then it may send more than one reboot messages.
But, any way, Ack from me.
 
 
Thanks,
Nagendra, 91-9866424860
www.hasolutions.in
https://www.linkedin.com/company/hasolutions/
High Availability Solutions Pvt. Ltd.
- OpenSAF Support and Services
 
 
 
 
 
 
 
--------- Original Message --------- Subject: RE: [devel] [PATCH 1/1] amf: 
Recover node that disconnnect from active AMFD [#2880]
From: "Tran Thuan" <thuan.t...@dektech.com.au>
Date: 7/24/18 1:42 pm
To: nagen...@hasolutions.in, opensaf-devel@lists.sourceforge.net

Hi Nagu,
 
 I had tested the remote fencing, it works and no multiple reboot to PL.
 
 Jul 9 14:48:27 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid msg id
 45, msg type 8, from 2030f should be 1
 Jul 9 14:48:27 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: reboot node
 2030f to recover it
 Jul 9 14:48:27 SC-2-1 osafamfd[4594]: Rebooting OpenSAF NodeId = 131855 EE
 Name = PL-2-3, Reason: Fencing remote node, OwnNodeId = 131343,
 SupervisionTime = 60
 Jul 9 14:48:27 SC-2-1 systemd[1]: Starting Session c3 of user root.
 Jul 9 14:48:27 SC-2-1 systemd[1]: Started Session c3 of user root.
 Jul 9 14:48:27 SC-2-1 external/libvirt[8939]: notice: Domain FT-REG-12-PL-3
 was stopped
 Jul 9 14:48:30 SC-2-1 external/libvirt[8939]: notice: Domain FT-REG-12-PL-3
 was started
 Jul 9 14:48:31 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid node
 ID (2030f)
 Jul 9 14:48:31 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid node
 ID (2030f)
 
 
 Best Regards,
 Thuan
 
 -----Original Message-----
 From: nagen...@hasolutions.in <nagen...@hasolutions.in> 
 Sent: Tuesday, July 24, 2018 1:24 PM
 To: opensaf-devel@lists.sourceforge.net
 Subject: Re: [devel] [PATCH 1/1] amf: Recover node that disconnnect from
 active AMFD [#2880]
 
 Hi Gary,
 Thanks for reminding me. My guess was if there is node leave at SC-1, then
 there should be node leave at PL-16 as well. And that need to be debug.
 The patch looks ok to me as well except it will result in sending multiple
 reboot command to PL-16 if there are more messages from PL-16.
 
 Thanks,
 Nagendra, 91-9866424860
 www.hasolutions.in
 https://www.linkedin.com/company/hasolutions/
 High Availability Solutions Pvt. Ltd.
 - OpenSAF Support and Services
 
 --------- Original Message --------- Subject: Re: [PATCH 1/1] amf: Recover
 node that disconnnect from active AMFD [#2880]
 From: "Gary Lee" <gary....@dektech.com.au>
 Date: 7/24/18 10:01 am
 To: "thuan.tran" <thuan.t...@dektech.com.au>, hans.nordeb...@ericsson.com
 Cc: opensaf-devel@lists.sourceforge.net, nagen...@hasolutions.in
 
 Hi Nagu
 
 Do you have any comments on this? It seems OK to me, but I know you've
 worked on similar scenarios with TIPC flickering before, where reboot is
 issued from the PL side.
 
 Thanks
 Gary
 
 On 09/07/18 16:37, thuan.tran wrote:
 > There is a abnormal state that AMFND on remote node keep sending >
 message to active AMFD but active AMFD see that node already left.
 > The msg_id expected is not matched and the remote node keep stuck > as
 out of control of active AMFD.
 > In this case, active AMFD can trigger remote fencing for that node > if
 possible, otherwise send reboot order directly.
 > ---
 > src/amf/amfd/ndfsm.cc | 2 --
 > src/amf/amfd/ndproc.cc | 16 ++++++++++++++++ > 2 files changed, 16
 insertions(+), 2 deletions(-) > > diff --git a/src/amf/amfd/ndfsm.cc
 b/src/amf/amfd/ndfsm.cc > index 9d54df13d..2d407be12 100644 > ---
 a/src/amf/amfd/ndfsm.cc > +++ b/src/amf/amfd/ndfsm.cc > @@ -796,7 +796,6
 @@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) { > */ >
 node->node_state = AVD_AVND_STATE_ABSENT; > node->saAmfNodeOperState =
 SA_AMF_OPERATIONAL_DISABLED; > - node->adest = 0; > node->rcv_msg_id = 0;
 > node->snd_msg_id = 0; > node->recvr_fail_sw = false; > @@ -1115,7
 +1114,6 @@ void avd_node_mark_absent(AVD_AVND *node) { > > LOG_NO("Node
 '%s' left the cluster", node->node_name.c_str()); > > - node->adest = 0;
 > node->rcv_msg_id = 0; > node->snd_msg_id = 0; > node->recvr_fail_sw =
 false; > diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc >
 index 428c26085..31d2263d2 100644 > --- a/src/amf/amfd/ndproc.cc > +++
 b/src/amf/amfd/ndproc.cc > @@ -73,6 +73,22 @@ AVD_AVND
 *avd_msg_sanity_chk(AVD_EVT *evt, SaClmNodeIdT node_id, > LOG_WA("%s:
 invalid msg id %u, msg type %u, from %x should be %u", > __FUNCTION__,
 msg_id, evt->info.avnd_msg->msg_type, node_id, > node->rcv_msg_id + 1); >
 + if (node->rcv_msg_id == 0) { > + /* Active AMFD see node left but node
 still see active AMFD > + and keep sending messages with msg_id increment
 */ > + LOG_WA("%s: reboot node %x to recover it", __FUNCTION__, node_id);
 > + Consensus consensus_service; > + if
 (consensus_service.IsRemoteFencingEnabled() == true) { > + std::string
 host_name = > + osaf_extended_name_borrow(&node->node_info.nodeName);
 > + int first = host_name.find_first_of("=") + 1; > + int end =
 host_name.find_first_of(","); > + host_name = host_name.substr(first,
 end-first); > + opensaf_reboot(node_id, host_name.c_str(), "Fencing remote
 node"); > + } else { > + avd_send_reboot_msg_directly(node);
 > + }
 > + }
 > return nullptr;
 > }
 >
 ----------------------------------------------------------------------------
 --
 Check out the vibrant tech community on one of the world's most engaging
 tech sites, Slashdot.org! http://sdm.link/slashdot
 _______________________________________________
 Opensaf-devel mailing list
 Opensaf-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/opensaf-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to