Hi Minh,

I think we should see this problem from fault management perspective also. Here repair of failed component is performed before the completion of recovery.In the problem, component faulted with comp-failover recovery and it was successfully repaired(instantiated) when SU switch-over was still pending.

Now the question is: Why it was never observed earlier? The reason is generally all components are assigned at least one CSI. In the present configuration failed component was not assigned any CSI. When this component was cleaned up and marked UNINSTANTIATED, AMFND sent comp-failover recovery request to AMFD. But after sending recovery request, it instantiated failed comp when SU has still assignments to be switch-overed. The code related to this assumes that comp will have at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(), TERMINATING to UNINSTANTIATED if block). For normal sequence of comp-failover, su is repaired after removal of assignment in avnd_su_si_oper_done() by calling avnd_err_su_repair().

For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page 195) about switch-overing all the SIs of failed SU in case of comp-failed recovery and not for other models. In current OpenSAF implementation we are following this for all models.

I think as a fix we should stop failed comp to get instantiated before removal of assignments. For this the check in clc.cc can be hardened to consider non-assigned comp failures.
Attached is the patch (2233_v2.patch) based on this idea/approach.

Thanks,
Praveen


On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
Hi Praveen,

Yes, you are right, I will update the description.

Thanks, Minh

Quoting praveen malviya <praveen.malv...@oracle.com>:

Hi Minh,

One quick question:
Ticket description says:
"Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends
safSi=AmfDemoTwon"
But logs are related to without SIdep. Also in the configuration
app3_twon3su3si.xml, SI dep classes are commented.
I think ticket description needs correction as problem is without SI dep.
Please confirm.

Thanks,
Praveen


On 17-Feb-17 10:58 AM, praveen malviya wrote:
Hi Minh,

I have started reviewing this patch.

Thanks,
Praveen

On 15-Feb-17 9:22 AM, minh chau wrote:
Hi all,

Have you had time to review this patch?
It changes the component failover sequence, so I think we need more
time
to look at it.

Thanks,
Minh

On 23/01/17 12:28, Minh Hon Chau wrote:
 src/amf/amfnd/avnd_su.h |   1 +
 src/amf/amfnd/clc.cc    |   3 ---
 src/amf/amfnd/di.cc     |  12 +++++++++++-
 src/amf/amfnd/susm.cc   |  32 +++++++++++++++++++++++++++++---
 4 files changed, 41 insertions(+), 7 deletions(-)


In case component failover, faulty component will be terminated. When
the reinstantiation
is done, amfnd will send su_oper_message (enabled) to amfd which is
running along with
component failover. In the reported problem, if su_oper_message
(enabled) comes to amfd
before the quiesced assignment response (as part of component failover
sequence) comes to
amfd, then this quiesced assignment response is ignored, thus
component failover will not
finish.

The problem is in function susi_success_sg_realign with act=5,
state=3, amfd always assumes
su having faulty component is OUT_OF_SERVICE. This assumption is true
in most of the time
when su_oper_message (enabled) comes a little later than quiesced
assignment response. In fact
the su_oper_message (enabled) is not designed as part of component
failover sequence, thus it
can come any time during the failover. If amfd is getting a bit busier
with RTA update then
the faulty component has enough to reinstiantiate so that amfnd sends
su_oper_message (enabled)
before quiesced assignment response, the reported problem will be
seen.

This patch hardens the component failover sequence by ensuring the
su_oper_message (enabled) to
be sent after su completes to remove assignment. This approach comes
from the similarity in
su failover, where the su_oper_message (enabled) is sent in repair
phase.

diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
--- a/src/amf/amfnd/avnd_su.h
+++ b/src/amf/amfnd/avnd_su.h
@@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
 extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
avnd_su_si_rec *);
 extern struct avnd_su_si_rec *avnd_silist_getlast(void);
 extern bool sufailover_in_progress(const AVND_SU *su);
+extern bool componentfailover_in_progress(const AVND_SU *su);
 extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
 extern bool all_csis_in_removed_state(const AVND_SU *su);
 extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag
*cb, const AVND_SU *su);
diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
--- a/src/amf/amfnd/clc.cc
+++ b/src/amf/amfnd/clc.cc
@@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
             (m_AVND_SU_IS_FAILOVER(su))) {
         /* yes, request director to orchestrate component failover */
         rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
-
-        //Reset component-failover here. SU failover is reset as part
of REPAIRED admin op.
-        m_AVND_SU_FAILOVER_RESET(su);
     }
       /*
diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
--- a/src/amf/amfnd/di.cc
+++ b/src/amf/amfnd/di.cc
@@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
             }
             m_AVND_SU_ALL_SI_RESET(su);
         }
-
+        if (componentfailover_in_progress(su)) {
+            if (all_csis_in_removed_state(su) == true) {
+                bool is_en;
+                m_AVND_SU_IS_ENABLED(su, is_en);
+                if (is_en) {
+                    if (avnd_di_oper_send(cb, su, 0) ==
NCSCC_RC_SUCCESS) {
+                        m_AVND_SU_FAILOVER_RESET(su);
+                    }
+                }
+            }
+        }
     /* free the contents of avnd message */
     avnd_msg_content_free(cb, &msg);
 diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
--- a/src/amf/amfnd/susm.cc
+++ b/src/amf/amfnd/susm.cc
@@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
             m_AVND_SU_IS_ENABLED(su, is_en);
             if (true == is_en) {
                 TRACE("SU oper state is enabled");
+                // do not send su_oper state if component failover is
in progress
                 m_AVND_SU_OPER_STATE_SET(su,
SA_AMF_OPERATIONAL_ENABLED);
-                rc = avnd_di_oper_send(cb, su, 0);
-                if (NCSCC_RC_SUCCESS != rc)
-                    goto done;
+                if (componentfailover_in_progress(su) == true) {
+                    si = reinterpret_cast<AVND_SU_SI_REC*>
+                            (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
+                    if (si == nullptr ||
all_csis_in_removed_state(su)) {
+                        rc = avnd_di_oper_send(cb, su, 0);
+                        if (rc != NCSCC_RC_SUCCESS)
+                            goto done;
+                        m_AVND_SU_FAILOVER_RESET(su);
+                    }
+                } else {
+                    rc = avnd_di_oper_send(cb, su, 0);
+                    if (NCSCC_RC_SUCCESS != rc)
+                        goto done;
+                }
             }
             else
                 TRACE("SU oper state is disabled");
@@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const AVND_S
 }
   /**
+ * This function checks if the componentfailover is going on.
+ * @param su: ptr to the SU .
+ *
+ * @return true/false.
+ */
+bool componentfailover_in_progress(const AVND_SU *su) {
+    if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) &&
+            (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) &&
(!su->is_ncs) &&
+            m_AVND_SU_IS_FAILOVER(su))
+        return true;
+    return false;
+}
+
+/**
  * This function checks if the sufailover and node switchover are
going on.
  * @param su: ptr to the SU .
  *



------------------------------------------------------------------------------

Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel



diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
--- a/src/amf/amfnd/clc.cc
+++ b/src/amf/amfnd/clc.cc
@@ -1115,8 +1115,9 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
                         * su termination, so we need not instantiate the comp, 
just reset
                         * the failed flag.
                         */
+                       
TRACE("comp->su->si_list.n_nodes:%u",comp->su->si_list.n_nodes);
                        if (m_AVND_COMP_IS_FAILED(comp) && 
!comp->csi_list.n_nodes &&
-                           !m_AVND_SU_IS_ADMN_TERM(comp->su) &&
+                           !m_AVND_SU_IS_ADMN_TERM(comp->su) && 
(comp->su->si_list.n_nodes == 0) &&
                            (cb->oper_state == SA_AMF_OPERATIONAL_ENABLED)) {
                                /* No need to restart component during 
shutdown, during surestart
                                   and during sufailover.It will be 
instantiated as part of repair.
@@ -1125,7 +1126,8 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
                                if (!m_AVND_IS_SHUTTING_DOWN(cb) && 
!sufailover_in_progress(comp->su) &&
                                                
(!m_AVND_SU_IS_RESTART(comp->su)))
                                        rc = avnd_comp_clc_fsm_trigger(cb, 
comp, AVND_COMP_CLC_PRES_FSM_EV_INST);
-                       } else if (m_AVND_COMP_IS_FAILED(comp) && 
!comp->csi_list.n_nodes) {
+                       } else if (m_AVND_COMP_IS_FAILED(comp) && 
!comp->csi_list.n_nodes &&
+                                       (comp->su->si_list.n_nodes == 0)) {
                                m_AVND_COMP_FAILED_RESET(comp); /*if we moved 
from restart -> term
                                                                                
                due to admn operation */
                        }
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to