Hi Alex, The patch looks good to me. Ack.
Thanks -Nagendra, +91-9866424860 High Availability Solutions (OpenSAF Support and Services) <http://www.hasolutions.in/> www.hasolutions.in <mailto:cont...@hasolutions.in> cont...@hasolutions.in Delaware, USA: +1 508-422-7725 | Hyderabad, India: +91 798-992-5293 From: Jones, Alex [mailto:ajo...@rbbn.com] Sent: 08 May 2019 01:17 To: hans.nordeb...@ericsson.com; nagen...@hasolutions.in; gary....@dektech.com.au Cc: opensaf-devel@lists.sourceforge.net; Jones, Alex Subject: [PATCH 1/1] amfnd: don't attempt su failover if active controller is rebooting [#3035] In N+M model CSI-remove responses can get lost if active controller reboots. In this case SG will be stuck in unstable state, and standby will never get assignments. We are the active controller, active for N+M, SU failover is set, and failfast on termination failure is set for the nodes. If a component in the SU crashes, and another component fails during cleanup, the node does failfast. It currently attempts to do su failover in this case, but the csi-remove responses from the payload can get lost because we are rebooting. They eventually show up on the new active, but we get message-id errors. Set a flag when the active controller is about to reboot. If the flag is set, then don't do SU failover. Let the new active take care of the failover. --- src/amf/amfd/node.cc | 1 + src/amf/amfd/node.h | 1 + src/amf/amfd/sgproc.cc | 7 +++++++ src/amf/amfd/util.cc | 3 +++ 4 files changed, 12 insertions(+) diff --git a/src/amf/amfd/node.cc b/src/amf/amfd/node.cc index 7fc764f22..b8d8a7d77 100644 --- a/src/amf/amfd/node.cc +++ b/src/amf/amfd/node.cc @@ -121,6 +121,7 @@ void AVD_AVND::initialize() { clm_pend_inv = {}; clm_change_start_preceded = {}; recvr_fail_sw = {}; + actv_ctrl_reboot_in_progress = {}; admin_ng = {}; } diff --git a/src/amf/amfd/node.h b/src/amf/amfd/node.h index ecee5c591..dbe48dc43 100644 --- a/src/amf/amfd/node.h +++ b/src/amf/amfd/node.h @@ -140,6 +140,7 @@ class AVD_AVND { CLM completed cb. */ bool recvr_fail_sw; /* to indicate there was node reboot because of node failover/switchover.*/ + bool actv_ctrl_reboot_in_progress; AVD_AMF_NG *admin_ng; /* points to the nodegroup on which admin operation is going on.*/ uint16_t node_up_msg_count; /* to count of node_up msg that director had diff --git a/src/amf/amfd/sgproc.cc b/src/amf/amfd/sgproc.cc index 1537acac3..7c8d9a558 100644 --- a/src/amf/amfd/sgproc.cc +++ b/src/amf/amfd/sgproc.cc @@ -478,6 +478,13 @@ static uint32_t sg_su_failover_func(AVD_SU *su) { goto done; } + if (su->su_on_node->actv_ctrl_reboot_in_progress) { + TRACE("'%s' is already going down, so not doing SU failover", + su->name.c_str()); + rc = NCSCC_RC_SUCCESS; + goto done; + } + su->set_oper_state(SA_AMF_OPERATIONAL_DISABLED); su->set_readiness_state(SA_AMF_READINESS_OUT_OF_SERVICE); if (su->saAmfSUAdminState == SA_AMF_ADMIN_LOCKED) diff --git a/src/amf/amfd/util.cc b/src/amf/amfd/util.cc index 14a4e0485..0dc3e99e3 100644 --- a/src/amf/amfd/util.cc +++ b/src/amf/amfd/util.cc @@ -1802,6 +1802,9 @@ void avd_d2n_reboot_snd(AVD_AVND *node) { if (avd_d2n_msg_snd(avd_cb, node, d2n_msg) != NCSCC_RC_SUCCESS) { LOG_ER("%s: snd to %x failed", __FUNCTION__, node->node_info.nodeId); d2n_msg_free(d2n_msg); + } else if (node->node_info.nodeId == avd_cb->node_id_avd) { + TRACE("rebooting active amf director which is ourself"); + node->actv_ctrl_reboot_in_progress = true; } } -- 2.17.2 _____ Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments. _____ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel