Hi Anand,

The node-repair-upgrade patch looks good now, it has passed our tests. Can you please publish the V2 that has all the changes?

thanks

Minh

On 30/7/20 8:26 am, s.an...@gethighavailability.com wrote:
Hi Minh,
Please ignore the second patch. Since recvr_fail_sw was not checkpointed in the 
code, so wrote the code to checkpoint and then everything worked fine.
Please take the first published patch and then apply 
node-repair-upgrade.patch(attached in the ticket #3204) on top of it and then 
do the testing. Please confirm once finished.

The following testing (including upgrade) were performed:
Start 3 nodes (SC-1/SC-2/PL-3). Run appl on PL-3(node autorepair is disabled) 
with recovery as 5.
TC #1: Kill demo on PL-3, it will go for node failover. Now repair the node. 
demo will come up again.
TC #2: Kill demo on PL-3, it will go for node failover. Do controller si-swap. 
Do repair the node. demo will come up again.
TC #3: Kill demo on PL-3, it will go for node failover. Do controller si-swap, 
lock clm node, unlock clm node. Do repair the node. demo will come up again.

Upgrade testing:
1. Start SC-1 without the patch as Active. Start PL-3 with the patch. Start 
SC-2 with the patch.
2. SI-SWAP controller, so that SC-2(with the patch) becomes Act and SC-1 
becomes Standby.
3. No stop SC-1 and start SC-1 with the patch. SC-1 becomes Standby. Now 
perform controller si-swap.
Repeat all the above test cases from TC #1 to TC #3
Everything works fine.

Thanks
Anand Sundararaj
Senior Solutions Architect | +1 480 686 4772
www.GetHighAvailability.com 
(https://am2.myprofessionalmail.com/appsuite/www.GetHighAvailability.com)
Get High Availability Today!
NJ, USA: +1 508-507-6507

On 07/28/2020 3:28 PM minhchau <minh.c...@dektech.com.au> wrote:

Hi Anand,

I have run some other clm tests, it now fails with your latest changes.

clm-adm -o lock safNode=PL-3,safCluster=myClmCluster

clm-adm -o unlock safNode=PL-3,safCluster=myClmCluster

After clm unlock, the amf oper state is still disabled.

/Minh

On 29/7/20 1:38 am, s.an...@gethighavailability.com wrote:
Thanks Minh for your comment and your time to review.

Hi Minh/Thang/Nagendra/Paul,
I am planning to push the patch by 30th July(thursday).
Please kindly find some time to review by 29th July(tomorrow) and
send your comments or Ack.

Thanks
Anand Sundararaj
Senior Solutions Architect | +1 480 686 4772
www.GetHighAvailability.com
Get High Availability Today!
NJ, USA: +1 508-507-6507

On 07/27/2020 8:28 PM minhchau <minh.c...@dektech.com.au> wrote:

Hi,

It works now.

/Minh

On 28/7/20 11:43 am, s.an...@gethighavailability.com wrote:
Hi Minh,
Good catch.
The following patch will solve the issue, please confirm:

diff --git a/src/amf/amfd/clm.cc b/src/amf/amfd/clm.cc
index cfbe36a..743a6c2 100644
--- a/src/amf/amfd/clm.cc
+++ b/src/amf/amfd/clm.cc
@@ -24,10 +24,12 @@
    #include "amf/amfd/node.h"
    #include "base/osaf_time.h"

-static void clm_node_join_complete(AVD_AVND *node) {
-  TRACE_ENTER();
-  /* Enable the node in any case. */
-  avd_node_oper_state_set(node, SA_AMF_OPERATIONAL_ENABLED);
+static void clm_node_join_complete(AVD_AVND *node, bool 
member_transition_state) {
+  TRACE_ENTER2("%u", member_transition_state);
+  /* Don't enable in case, node was up and running and node failover/switchover
+     case was reported i.e. member_transition_state == false */
+  if (member_transition_state == true)
+    avd_node_oper_state_set(node, SA_AMF_OPERATIONAL_ENABLED);

      /* For each of the SUs calculate the readiness state.
       ** call the SG FSM with the new readiness state.
@@ -415,6 +417,12 @@ static void clm_track_cb(
                avd_node_add_nodeid(node);
                TRACE("Node Joined '%s' '%zu'", node_name.c_str(),
                      node_name.length());
+           // To differentiate between CLM lock and track cbk in case of
+            // si-swap
+           bool member_transition_state = false;
+           if (node->node_info.member == SA_FALSE)
+              member_transition_state = true;
+           TRACE("member_transition_state %u", member_transition_state);

                node->node_info.member = SA_TRUE;
                if (avd_cb->avd_peer_ver < AVD_MBCSV_SUB_PART_VERSION_4) {
@@ -427,7 +435,7 @@ static void clm_track_cb(
                if (node->node_state == AVD_AVND_STATE_PRESENT) {
                  TRACE("Node already up and configured");
                  /* now try to instantiate all the SUs that need to be */
-              clm_node_join_complete(node);
+              clm_node_join_complete(node, member_transition_state);
                }
              } else {
                LOG_IN("AMF-node not configured on this CLM-node '%s'",


Thanks

Anand Sundararaj
Senior Solutions Architect | 480 686 4772
www.GetHighAvailability.com

Get High Availability Today!
NJ, USA: +1 508-507-6507

On 07/27/2020 12:29 PM minhchau <minh.c...@dektech.com.au> wrote:

Hi Anand,

I have run the following test:

1. Install 2N amf demo on PL-4 and PL-5

2. Disable saAmfNodeAutoRepair on both PLs

3. Kill amf_demo on PL-5 to escalate Node Failover recovery

4. Run amf-adm repaired <PL-5 dn>

        -> The command works, PL-5 is repaired

However, after step 3, do 2N OpenSAF si-swap, then the 'repaired' does
not work

root@PL-5:~#  amf-adm repaired safAmfNode=PL-5,safAmfCluster=myAmfCluster
error - saImmOmAdminOperationInvoke_2 admin-op RETURNED:
SA_AIS_ERR_NO_OP (28)
error-string: Admin repair request for
'safAmfNode=PL-5,safAmfCluster=myAmfCluster', op state already enabled

Replace si-swap by a SC failover, the command also has the same error.

Thanks

Minh

On 24/7/20 9:04 am, s.an...@gethighavailability.com wrote:
From: Anand Sundararaj <s.an...@gethighavailability.com>

---
     src/amf/amfd/node.cc    | 56 +++++++++++++++++++++++++++++++++++++++++++-
     src/amf/amfd/node.h     |  2 ++
     src/amf/amfd/sgproc.cc  | 18 ++++++++++++++
     src/amf/amfnd/avnd_su.h |  1 +
     src/amf/amfnd/di.cc     | 62 
+++++++++++++++++++++++++++++++++++++++++++++++++
     src/amf/amfnd/err.cc    |  2 +-
     src/amf/amfnd/su.cc     |  2 +-
     7 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/src/amf/amfd/node.cc b/src/amf/amfd/node.cc
index 2c04fca..4c63055 100644
--- a/src/amf/amfd/node.cc
+++ b/src/amf/amfd/node.cc
@@ -108,6 +108,7 @@ void AVD_AVND::initialize() {
       saAmfNodeOperState = SA_AMF_OPERATIONAL_DISABLED;
       admin_node_pend_cbk = {};
       su_cnt_admin_oper = {};
+  su_cnt_admin_repair = {};
       node_state = AVD_AVND_STATE_ABSENT;
       list_of_ncs_su = {};
       list_of_su = {};
@@ -1247,6 +1248,23 @@ void AVD_AVND::node_sus_termstate_set(bool term_state) 
const {
     }
/**
+ * Update count of application SUs of the given node.
+ *
+ * @param node pointer
+ * @return None
+ */
+static uint32_t application_sus_count(AVD_AVND *node) {
+  uint32_t count = 0;
+  /* Count the applications SUs hosted on this node. */
+  for (const auto &su : node->list_of_su) {
+    if (su->su_is_external == false)
+      count++;
+  }
+  TRACE("count '%u'", count);
+  return count;
+}
+
+/**
      * Handle admin operations on SaAmfNode objects.
      *
      * @param immOiHandle
@@ -1532,9 +1550,44 @@ static void node_admin_op_cb(SaImmOiHandleT immOiHandle,
           }
           break;
+ case SA_AMF_ADMIN_REPAIRED:
+      if (node->saAmfNodeOperState == SA_AMF_OPERATIONAL_ENABLED) {
+        report_admin_op_error(
+            immOiHandle, invocation, SA_AIS_ERR_NO_OP, nullptr,
+            "Admin repair request for '%s', op state already enabled",
+            node->name.c_str());
+        goto done;
+      }
+
+      if (node->node_info.member == false) {
+        LOG_NO("'%s' ADMIN_REPAIRED: CLM node is not member",
+            node->name.c_str());
+        avd_saImmOiAdminOperationResult(immOiHandle, invocation, 
SA_AIS_ERR_TRY_AGAIN);
+        goto done;
+      }
+
+      if (0 == application_sus_count(node)) {
+        // Node is configured without appl su, return from here.
+        avd_node_oper_state_set(node, SA_AMF_OPERATIONAL_ENABLED);
+        avd_saImmOiAdminOperationResult(immOiHandle, invocation, SA_AIS_OK);
+        goto done;
+      }
+
+      /* forward the admin op req to the node director */
+      if (avd_admin_op_msg_snd(node->name, AVSV_SA_AMF_NODE, 
SA_AMF_ADMIN_REPAIRED,
+            node) == NCSCC_RC_SUCCESS) {
+        node->admin_node_pend_cbk.admin_oper = SA_AMF_ADMIN_REPAIRED;
+        node->admin_node_pend_cbk.invocation = invocation;
+        node->su_cnt_admin_repair = application_sus_count(node);
+      } else {
+        report_admin_op_error(immOiHandle, invocation, SA_AIS_ERR_TIMEOUT, 
nullptr,
+            "Admin op request send failed '%s'", node->name.c_str());
+      }
+      break;
+
         default:
           report_admin_op_error(immOiHandle, invocation, 
SA_AIS_ERR_NOT_SUPPORTED,
-                            nullptr, "UNSUPPORTED ADMIN OPERATION (%llu)",
+          nullptr, "UNSUPPORTED ADMIN OPERATION (%llu)",
                                 operationId);
           break;
       }
@@ -1591,6 +1644,7 @@ void node_reset_su_try_inst_counter(const AVD_AVND *node) 
{
         su->sg_of_su->try_inst_counter = 0;
       }
     }
+
     /**
      * @brief  Checks all  nodegroup of nodes are in UNLOCKED state.
      * @param  ptr to Node (AVD_AVND).
diff --git a/src/amf/amfd/node.h b/src/amf/amfd/node.h
index 097f54b..37b8ce9 100644
--- a/src/amf/amfd/node.h
+++ b/src/amf/amfd/node.h
@@ -108,6 +108,8 @@ class AVD_AVND {
                                                 callbacks on this node */
       uint32_t su_cnt_admin_oper;             /* count to keep track SUs on 
this node
                                                  undergoing node admin op */
+  uint32_t su_cnt_admin_repair; /* Count to keep track SUs on this node
+                                   being repaired by node admin repair op */
/************ AMF B.04 **************************************************/ diff --git a/src/amf/amfd/sgproc.cc b/src/amf/amfd/sgproc.cc
index 1d7e62a..335d043 100644
--- a/src/amf/amfd/sgproc.cc
+++ b/src/amf/amfd/sgproc.cc
@@ -1025,6 +1025,23 @@ void avd_su_oper_state_evh(AVD_CL_CB *cb, AVD_EVT *evt) {
             }
           }
         } else { /* if(su->sg_of_su->sg_ncs_spec == true) */
+      // This is a case of Node repair
+      if ((n2d_msg->msg_info.n2d_opr_state.node_oper_state ==
+            SA_AMF_OPERATIONAL_ENABLED) &&
+          (node->admin_node_pend_cbk.invocation) &&
+          (node->admin_node_pend_cbk.admin_oper == SA_AMF_ADMIN_REPAIRED)) {
+        avd_node_oper_state_set(node, SA_AMF_OPERATIONAL_ENABLED);
+        if (node->su_cnt_admin_repair > 0) node->su_cnt_admin_repair--;
+        if (node->su_cnt_admin_repair == 0) {
+          /* if this is the last SU then send out the pending callback */
+          avd_saImmOiAdminOperationResult(
+              cb->immOiHandle, node->admin_node_pend_cbk.invocation,
+              SA_AIS_OK);
+          node->admin_node_pend_cbk.invocation = 0;
+          node->admin_node_pend_cbk.admin_oper =
+            static_cast<SaAmfAdminOperationIdT>(0);
+        }
+      }
           /* If oper state of Uninstantiated SU got ENABLED so try to 
instantiate it
              after evaluating SG. */
           if (su->saAmfSUPresenceState == SA_AMF_PRESENCE_UNINSTANTIATED) {
@@ -2326,6 +2343,7 @@ void avd_node_down_mw_susi_failover(AVD_CL_CB *cb, 
AVD_AVND *avnd) {
                               SA_AIS_ERR_REPAIR_PENDING, 
&avnd->admin_node_pend_cbk,
                               "node failure");
         avnd->su_cnt_admin_oper = 0;
+    avnd->su_cnt_admin_repair = 0;
       }
TRACE_LEAVE();
diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
index 45effd6..46a711f 100644
--- a/src/amf/amfnd/avnd_su.h
+++ b/src/amf/amfnd/avnd_su.h
@@ -461,5 +461,6 @@ extern AVND_SU *avnd_sudb_rec_get_next(AmfDb<std::string, 
AVND_SU> &sudb,
     extern void sudb_rec_comp_add(AVND_SU *su, AVND_COMP *comp, uint32_t *rc);
     uint32_t avnd_evt_avd_compcsi_evh(struct avnd_cb_tag *cb,
                                       struct avnd_evt_tag *evt);
+extern bool comp_in_term_failed_state(void);
#endif // AMF_AMFND_AVND_SU_H_
diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
index 7b01868..5bff121 100644
--- a/src/amf/amfnd/di.cc
+++ b/src/amf/amfnd/di.cc
@@ -92,6 +92,68 @@ static uint32_t avnd_evt_node_admin_op_req(AVND_CB *cb, 
AVND_EVT *evt) {
       cb->rcv_msg_id = info->msg_id;
switch (info->oper_id) {
+    case SA_AMF_ADMIN_REPAIRED: {
+      /* Node has been repaired. Reset states of this node and all DISABLED
+       * SUs/comps. And update AMF director accordingly.
+       * No need to reset states of ENABLED SUs(in case of
+       * SA_AMF_NODE_SWITCHOVER, only faulty components/SU are disabled,
+       * other SUs assignments are simply switched over.)
+       * But for ENABLED Sus, send oper state message, so that, they
+       * are evaluated for assignments.
+       */
+      LOG_NO("Repair request for Node'%s'", Amf::to_string(&info->dn).c_str());
+
+      // Reset this node state first, so that when SU enabled message
+      // reaches at Amfd, it starts giving assignments to those SUs.
+      cb->oper_state = SA_AMF_OPERATIONAL_ENABLED;
+      cb->term_state = AVND_TERM_STATE_UP;
+      cb->node_err_esc_level = AVND_ERR_ESC_LEVEL_0;
+      cb->failed_su = nullptr;
+      cb->su_failover_cnt = 0;
+
+      /* Reset all faulty SUs and Comps */
+      for (AVND_SU *su = avnd_sudb_rec_get_next(cb->sudb, ""); su != nullptr;
+          su = avnd_sudb_rec_get_next(cb->sudb, su->name)) {
+        if (su->is_ncs || su->su_is_external)
+          continue;
+        su->comp_restart_cnt = 0;
+        su->su_restart_cnt = 0;
+        su->su_err_esc_level = AVND_ERR_ESC_LEVEL_0;
+        if (m_AVND_SU_OPER_STATE_IS_ENABLED(su)){
+          rc = avnd_di_oper_send(cb, su, 0);
+          continue;
+        }
+        // Reset the components states.
+        for (AVND_COMP *comp = m_AVND_COMP_FROM_SU_DLL_NODE_GET(
+              m_NCS_DBLIST_FIND_FIRST(&su->comp_list));
+            comp; comp = m_AVND_COMP_FROM_SU_DLL_NODE_GET(
+              m_NCS_DBLIST_FIND_NEXT(&comp->su_dll_node))) {
+          comp->admin_oper = false;
+          m_AVND_COMP_STATE_RESET(comp);
+          avnd_comp_pres_state_set(cb, comp, SA_AMF_PRESENCE_UNINSTANTIATED);
+
+          m_AVND_COMP_OPER_STATE_SET(comp, SA_AMF_OPERATIONAL_ENABLED);
+          avnd_di_uns32_upd_send(AVSV_SA_AMF_COMP, saAmfCompOperState_ID,
+              comp->name, comp->oper);
+        }
+
+        // Reset the SUs states.
+        if ((su->pres == SA_AMF_PRESENCE_TERMINATION_FAILED) &&
+            (comp_in_term_failed_state() == false))
+          avnd_failed_state_file_delete();
+
+        su->admin_op_Id = static_cast<SaAmfAdminOperationIdT>(0);
+        reset_suRestart_flag(su);
+        m_AVND_SU_STATE_RESET(su);
+        m_AVND_SU_OPER_STATE_SET(su, SA_AMF_OPERATIONAL_ENABLED);
+        avnd_di_uns32_upd_send(AVSV_SA_AMF_SU, saAmfSUOperState_ID, su->name,
+            su->oper);
+        avnd_su_pres_state_set(cb, su, SA_AMF_PRESENCE_UNINSTANTIATED);
+        rc = avnd_di_oper_send(cb, su, 0);
+      } // End of for loop for su
+      break;
+    } // End of case SA_AMF_ADMIN_REPAIRED:
+
         default:
           LOG_NO("%s: unsupported adm op %u", __FUNCTION__, info->oper_id);
           rc = NCSCC_RC_FAILURE;
diff --git a/src/amf/amfnd/err.cc b/src/amf/amfnd/err.cc
index aeec373..c549047 100644
--- a/src/amf/amfnd/err.cc
+++ b/src/amf/amfnd/err.cc
@@ -1135,7 +1135,7 @@ uint32_t avnd_err_rcvr_node_failover(AVND_CB *cb, AVND_SU 
*failed_su,
       for (comp = avnd_compdb_rec_get_next(cb->compdb, ""); comp != nullptr;
            comp = avnd_compdb_rec_get_next(cb->compdb, comp->name)) {
         if (comp->su->is_ncs || comp->su->su_is_external) continue;
-
+    rc = avnd_comp_curr_info_del(cb, comp);
         rc = avnd_comp_clc_fsm_run(cb, comp, 
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP);
         if (rc != NCSCC_RC_SUCCESS) {
           LOG_ER("'%s' termination failed", comp->name.c_str());
diff --git a/src/amf/amfnd/su.cc b/src/amf/amfnd/su.cc
index d8e388a..cb4928d 100644
--- a/src/amf/amfnd/su.cc
+++ b/src/amf/amfnd/su.cc
@@ -639,7 +639,7 @@ done:
       return rc;
     }
-static bool comp_in_term_failed_state(void) {
+bool comp_in_term_failed_state(void) {
       for (AVND_COMP *comp = avnd_compdb_rec_get_next(avnd_cb->compdb, "");
            comp != nullptr;
            comp = avnd_compdb_rec_get_next(avnd_cb->compdb, comp->name)) {


_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to