Hi Minh,
I think we should see this problem from fault management perspective
also. Here repair of failed component is performed before the completion
of recovery.In the problem, component faulted with comp-failover
recovery and it was successfully repaired(instantiated) when SU
switch-over was still pending.
Now the question is: Why it was never observed earlier? The reason is
generally all components are assigned at least one CSI. In the present
configuration failed component was not assigned any CSI. When this
component was cleaned up and marked UNINSTANTIATED, AMFND sent
comp-failover recovery request to AMFD. But after sending recovery
request, it instantiated failed comp when SU has still assignments to be
switch-overed. The code related to this assumes that comp will have
at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(),
TERMINATING to UNINSTANTIATED if block). For normal sequence of
comp-failover, su is repaired after removal of assignment in
avnd_su_si_oper_done() by calling avnd_err_su_repair().
For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page
195) about switch-overing all the SIs of failed SU in case of
comp-failed recovery and not for other models. In current OpenSAF
implementation we are following this for all models.
I think as a fix we should stop failed comp to get instantiated before
removal of assignments. For this the check in clc.cc can be hardened to
consider non-assigned comp failures.
Attached is the patch (2233_v2.patch) based on this idea/approach.
Thanks,
Praveen
On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
Hi Praveen,
Yes, you are right, I will update the description.
Thanks, Minh
Quoting praveen malviya <praveen.malv...@oracle.com>:
Hi Minh,
One quick question:
Ticket description says:
"Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends
safSi=AmfDemoTwon"
But logs are related to without SIdep. Also in the configuration
app3_twon3su3si.xml, SI dep classes are commented.
I think ticket description needs correction as problem is without SI dep.
Please confirm.
Thanks,
Praveen
On 17-Feb-17 10:58 AM, praveen malviya wrote:
Hi Minh,
I have started reviewing this patch.
Thanks,
Praveen
On 15-Feb-17 9:22 AM, minh chau wrote:
Hi all,
Have you had time to review this patch?
It changes the component failover sequence, so I think we need more
time
to look at it.
Thanks,
Minh
On 23/01/17 12:28, Minh Hon Chau wrote:
src/amf/amfnd/avnd_su.h | 1 +
src/amf/amfnd/clc.cc | 3 ---
src/amf/amfnd/di.cc | 12 +++++++++++-
src/amf/amfnd/susm.cc | 32 +++++++++++++++++++++++++++++---
4 files changed, 41 insertions(+), 7 deletions(-)
In case component failover, faulty component will be terminated. When
the reinstantiation
is done, amfnd will send su_oper_message (enabled) to amfd which is
running along with
component failover. In the reported problem, if su_oper_message
(enabled) comes to amfd
before the quiesced assignment response (as part of component failover
sequence) comes to
amfd, then this quiesced assignment response is ignored, thus
component failover will not
finish.
The problem is in function susi_success_sg_realign with act=5,
state=3, amfd always assumes
su having faulty component is OUT_OF_SERVICE. This assumption is true
in most of the time
when su_oper_message (enabled) comes a little later than quiesced
assignment response. In fact
the su_oper_message (enabled) is not designed as part of component
failover sequence, thus it
can come any time during the failover. If amfd is getting a bit busier
with RTA update then
the faulty component has enough to reinstiantiate so that amfnd sends
su_oper_message (enabled)
before quiesced assignment response, the reported problem will be
seen.
This patch hardens the component failover sequence by ensuring the
su_oper_message (enabled) to
be sent after su completes to remove assignment. This approach comes
from the similarity in
su failover, where the su_oper_message (enabled) is sent in repair
phase.
diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
--- a/src/amf/amfnd/avnd_su.h
+++ b/src/amf/amfnd/avnd_su.h
@@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
avnd_su_si_rec *);
extern struct avnd_su_si_rec *avnd_silist_getlast(void);
extern bool sufailover_in_progress(const AVND_SU *su);
+extern bool componentfailover_in_progress(const AVND_SU *su);
extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
extern bool all_csis_in_removed_state(const AVND_SU *su);
extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag
*cb, const AVND_SU *su);
diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
--- a/src/amf/amfnd/clc.cc
+++ b/src/amf/amfnd/clc.cc
@@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
(m_AVND_SU_IS_FAILOVER(su))) {
/* yes, request director to orchestrate component failover */
rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
-
- //Reset component-failover here. SU failover is reset as part
of REPAIRED admin op.
- m_AVND_SU_FAILOVER_RESET(su);
}
/*
diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
--- a/src/amf/amfnd/di.cc
+++ b/src/amf/amfnd/di.cc
@@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
}
m_AVND_SU_ALL_SI_RESET(su);
}
-
+ if (componentfailover_in_progress(su)) {
+ if (all_csis_in_removed_state(su) == true) {
+ bool is_en;
+ m_AVND_SU_IS_ENABLED(su, is_en);
+ if (is_en) {
+ if (avnd_di_oper_send(cb, su, 0) ==
NCSCC_RC_SUCCESS) {
+ m_AVND_SU_FAILOVER_RESET(su);
+ }
+ }
+ }
+ }
/* free the contents of avnd message */
avnd_msg_content_free(cb, &msg);
diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
--- a/src/amf/amfnd/susm.cc
+++ b/src/amf/amfnd/susm.cc
@@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
m_AVND_SU_IS_ENABLED(su, is_en);
if (true == is_en) {
TRACE("SU oper state is enabled");
+ // do not send su_oper state if component failover is
in progress
m_AVND_SU_OPER_STATE_SET(su,
SA_AMF_OPERATIONAL_ENABLED);
- rc = avnd_di_oper_send(cb, su, 0);
- if (NCSCC_RC_SUCCESS != rc)
- goto done;
+ if (componentfailover_in_progress(su) == true) {
+ si = reinterpret_cast<AVND_SU_SI_REC*>
+ (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
+ if (si == nullptr ||
all_csis_in_removed_state(su)) {
+ rc = avnd_di_oper_send(cb, su, 0);
+ if (rc != NCSCC_RC_SUCCESS)
+ goto done;
+ m_AVND_SU_FAILOVER_RESET(su);
+ }
+ } else {
+ rc = avnd_di_oper_send(cb, su, 0);
+ if (NCSCC_RC_SUCCESS != rc)
+ goto done;
+ }
}
else
TRACE("SU oper state is disabled");
@@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const AVND_S
}
/**
+ * This function checks if the componentfailover is going on.
+ * @param su: ptr to the SU .
+ *
+ * @return true/false.
+ */
+bool componentfailover_in_progress(const AVND_SU *su) {
+ if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) &&
+ (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) &&
(!su->is_ncs) &&
+ m_AVND_SU_IS_FAILOVER(su))
+ return true;
+ return false;
+}
+
+/**
* This function checks if the sufailover and node switchover are
going on.
* @param su: ptr to the SU .
*
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel
diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
--- a/src/amf/amfnd/clc.cc
+++ b/src/amf/amfnd/clc.cc
@@ -1115,8 +1115,9 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
* su termination, so we need not instantiate the comp,
just reset
* the failed flag.
*/
+
TRACE("comp->su->si_list.n_nodes:%u",comp->su->si_list.n_nodes);
if (m_AVND_COMP_IS_FAILED(comp) &&
!comp->csi_list.n_nodes &&
- !m_AVND_SU_IS_ADMN_TERM(comp->su) &&
+ !m_AVND_SU_IS_ADMN_TERM(comp->su) &&
(comp->su->si_list.n_nodes == 0) &&
(cb->oper_state == SA_AMF_OPERATIONAL_ENABLED)) {
/* No need to restart component during
shutdown, during surestart
and during sufailover.It will be
instantiated as part of repair.
@@ -1125,7 +1126,8 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
if (!m_AVND_IS_SHUTTING_DOWN(cb) &&
!sufailover_in_progress(comp->su) &&
(!m_AVND_SU_IS_RESTART(comp->su)))
rc = avnd_comp_clc_fsm_trigger(cb,
comp, AVND_COMP_CLC_PRES_FSM_EV_INST);
- } else if (m_AVND_COMP_IS_FAILED(comp) &&
!comp->csi_list.n_nodes) {
+ } else if (m_AVND_COMP_IS_FAILED(comp) &&
!comp->csi_list.n_nodes &&
+ (comp->su->si_list.n_nodes == 0)) {
m_AVND_COMP_FAILED_RESET(comp); /*if we moved
from restart -> term
due to admn operation */
}
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel