Hi Nagu,

I had tested the remote fencing, it works and no multiple reboot to PL.

Jul  9 14:48:27 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid msg id
45, msg type 8, from 2030f should be 1
Jul  9 14:48:27 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: reboot node
2030f to recover it
Jul  9 14:48:27 SC-2-1 osafamfd[4594]: Rebooting OpenSAF NodeId = 131855 EE
Name = PL-2-3, Reason: Fencing remote node, OwnNodeId = 131343,
SupervisionTime = 60
Jul  9 14:48:27 SC-2-1 systemd[1]: Starting Session c3 of user root.
Jul  9 14:48:27 SC-2-1 systemd[1]: Started Session c3 of user root.
Jul  9 14:48:27 SC-2-1 external/libvirt[8939]: notice: Domain FT-REG-12-PL-3
was stopped
Jul  9 14:48:30 SC-2-1 external/libvirt[8939]: notice: Domain FT-REG-12-PL-3
was started
Jul  9 14:48:31 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid node
ID (2030f)
Jul  9 14:48:31 SC-2-1 osafamfd[4594]: WA avd_msg_sanity_chk: invalid node
ID (2030f)


Best Regards,
Thuan

-----Original Message-----
From: nagen...@hasolutions.in <nagen...@hasolutions.in> 
Sent: Tuesday, July 24, 2018 1:24 PM
To: opensaf-devel@lists.sourceforge.net
Subject: Re: [devel] [PATCH 1/1] amf: Recover node that disconnnect from
active AMFD [#2880]

Hi Gary,
Thanks for reminding me. My guess was if there is node leave at SC-1, then
there should be node leave at PL-16 as well. And that need to be debug.
The patch looks ok to me as well except it will result in sending multiple
reboot command to PL-16 if there are more messages from PL-16.
 
Thanks,
Nagendra, 91-9866424860
www.hasolutions.in
https://www.linkedin.com/company/hasolutions/
High Availability Solutions Pvt. Ltd.
- OpenSAF Support and Services
 
--------- Original Message --------- Subject: Re: [PATCH 1/1] amf: Recover
node that disconnnect from active AMFD [#2880]
From: "Gary Lee" <gary....@dektech.com.au>
Date: 7/24/18 10:01 am
To: "thuan.tran" <thuan.t...@dektech.com.au>, hans.nordeb...@ericsson.com
Cc: opensaf-devel@lists.sourceforge.net, nagen...@hasolutions.in

Hi Nagu
 
 Do you have any comments on this? It seems OK to me, but I know you've
worked on similar scenarios with TIPC flickering before, where reboot is
issued from the PL side.
 
 Thanks
 Gary
 
 On 09/07/18 16:37, thuan.tran wrote:
 > There is a abnormal state that AMFND on remote node keep sending  >
message to active AMFD but active AMFD see that node already left.
 > The msg_id expected is not matched and the remote node keep stuck  > as
out of control of active AMFD.
 > In this case, active AMFD can trigger remote fencing for that node  > if
possible, otherwise send reboot order directly.
 > ---
 > src/amf/amfd/ndfsm.cc | 2 --
 > src/amf/amfd/ndproc.cc | 16 ++++++++++++++++  > 2 files changed, 16
insertions(+), 2 deletions(-)  >  > diff --git a/src/amf/amfd/ndfsm.cc
b/src/amf/amfd/ndfsm.cc  > index 9d54df13d..2d407be12 100644  > ---
a/src/amf/amfd/ndfsm.cc  > +++ b/src/amf/amfd/ndfsm.cc  > @@ -796,7 +796,6
@@ void avd_mds_avnd_down_evh(AVD_CL_CB *cb, AVD_EVT *evt) {  > */  >
node->node_state = AVD_AVND_STATE_ABSENT;  > node->saAmfNodeOperState =
SA_AMF_OPERATIONAL_DISABLED;  > - node->adest = 0;  > node->rcv_msg_id = 0;
> node->snd_msg_id = 0;  > node->recvr_fail_sw = false;  > @@ -1115,7
+1114,6 @@ void avd_node_mark_absent(AVD_AVND *node) {  >  > LOG_NO("Node
'%s' left the cluster", node->node_name.c_str());  >  > - node->adest = 0;
> node->rcv_msg_id = 0;  > node->snd_msg_id = 0;  > node->recvr_fail_sw =
false;  > diff --git a/src/amf/amfd/ndproc.cc b/src/amf/amfd/ndproc.cc  >
index 428c26085..31d2263d2 100644  > --- a/src/amf/amfd/ndproc.cc  > +++
b/src/amf/amfd/ndproc.cc  > @@ -73,6 +73,22 @@ AVD_AVND
*avd_msg_sanity_chk(AVD_EVT *evt, SaClmNodeIdT node_id,  > LOG_WA("%s:
invalid msg id %u, msg type %u, from %x should be %u",  > __FUNCTION__,
msg_id, evt->info.avnd_msg->msg_type, node_id,  > node->rcv_msg_id + 1);  >
+ if (node->rcv_msg_id == 0) {  > + /* Active AMFD see node left but node
still see active AMFD  > + and keep sending messages with msg_id increment
*/  > + LOG_WA("%s: reboot node %x to recover it", __FUNCTION__, node_id);
> + Consensus consensus_service;  > + if
(consensus_service.IsRemoteFencingEnabled() == true) {  > + std::string
host_name =  > + osaf_extended_name_borrow(&node->node_info.nodeName);
 > + int first = host_name.find_first_of("=") + 1;  > + int end =
host_name.find_first_of(",");  > + host_name = host_name.substr(first,
end-first);  > + opensaf_reboot(node_id, host_name.c_str(), "Fencing remote
node");  > + } else {  > + avd_send_reboot_msg_directly(node);
 > + }
 > + }
 > return nullptr;
 > }
 >
----------------------------------------------------------------------------
--
Check out the vibrant tech community on one of the world's most engaging
tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to