Hi, Thanks Gary. @Nagu, Praveen: Have you had time to check the example in my previous email? The ticket #2179 is about to document that full escalation is supported for SC absence feature, it is waiting for fix of #2233. I think there's not big change in code for #2233, it's a matter of decision to make for re-instantiation of failed component.
Thanks, Minh On 01/03/17 15:42, Gary Lee wrote: > Hi > > It seems the component should be re-instantiated if it has no CSI. Whether or > not there is an SI assigned should be irrelevant? > > Thanks > Gary > > -----Original Message----- > From: minh chau <minh.c...@dektech.com.au> > Date: Thursday, 23 February 2017 at 3:16 pm > To: Nagendra Kumar <nagendr...@oracle.com>, Praveen Malviya > <praveen.malv...@oracle.com> > Cc: <hans.nordeb...@ericsson.com>, gary <gary....@dektech.com.au>, > <long.hb.ngu...@dektech.com.au>, <opensaf-devel@lists.sourceforge.net> > Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message > synchronizes with component failover sequence [#2233] > > Hi Nagu, Praveen, > > Please find my comment in [Minh3] > > Thanks, > Minh > > On 22/02/17 19:34, Nagendra Kumar wrote: > >>> Since in spec there is no specific discussion for comp-failover > recovery for an unassigned comp, I will encourage other maintainers also to > provide inputs. > > I do agree for not instantiating failed component before recovery, > this keeps the approach similar to SU failover also. > [Minh3]: There's one example of component failover that I would like us > to have a look > - 2N application, SU4/SU5 has active/standby assignment respectively, > each SU has 3 components > - Add a sleep of 10 seconds in clc script start command of first > component C41 of SU4 > Steps: > 1- Kill C41 to trigger component failover > 2- SU4 goes for quiesced assignment > 3- SU5 goes for active assignment > 4- SU4 is removed its assignment > 5- Now there's a pause of 10 seconds due to clc script start, to ensure > that C41 is healthy > 6- Next SU4 has standby assignment. > > From the above example, I think we can see some problems if the > re-instantiation of C41 is delayed: > - Because C41 is faulty, it needs to be restarted ok because its SU has > assignment > - Moving re-instantiation of C41 is further down that means the recovery > will take longer > - What if re-instantiation of C41 leads to instantation-failed > > Whether or not the C41 has assignment or is unassigned, the > OperState/PresenceState result from re-instantiation of faulty C41 > affects to SU4's eligibility for assignment. > There's a parallelism between [restart of faulty component C41] and > [movement from Active->Quiesced->Removed assignment of SU4], it's good > to have and it's current behavior of amfnd. > It's my understanding so far but I could be wrong. Let's check with Hans > and Gary. > > > > > @Minh: If you don't mind, we can take su oper state changes in an > enhancement. What do you say ? > [Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 > release. > > > > Thanks > > -Nagu > > > >> -----Original Message----- > >> From: praveen malviya > >> Sent: 21 February 2017 11:15 > >> To: minh chau > >> Cc: hans.nordeb...@ericsson.com; Nagendra Kumar; > >> gary....@dektech.com.au; long.hb.ngu...@dektech.com.au; opensaf- > >> de...@lists.sourceforge.net > >> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational > message > >> synchronizes with component failover sequence [#2233] > >> > >> Hi Minh, > >> > >> Please find my response inline with [Praveen]. > >> > >> Thanks, > >> Praveen > >> > >> On 20-Feb-17 6:58 AM, minh chau wrote: > >>> Hi Praveen, > >>> > >>> Thanks for your V2 patch, I have tested V2 in scenario of ticket > #2233 > >>> and #1902, it also can fix the problem. > >>> Here we have 2 solutions: > >>> - The one I sent for review is letting the failed component to be > >>> instantiated, I think it is current behavior. But one change is that > >>> amfnd will not report su operational message to amfd until amfnd > >>> finishes removing the assignment of (faulty) su which contains the > >>> failed component > >>> - The V2 patch postpones the instantiation of failed component. amfnd > >>> will instantiate the failed component (via avnd_err_su_repair) after > >>> amfnd finishes removing the assignment of faulty su. > >>> > >>> So basically the difference is the time that the failed component > should > >>> be instantiated. > >>> > >>> Still in item 3.11.1.3.2: > >>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over; > >>> other-wise, the number of active service units would be higher than > what > >>> is allowed by the redundancy model. However, in an Nway redundancy > >>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover > >>> configuration attribute of the ser-vice unit is set to SA_FALSE), > and a > >>> repair of C2 should be attempted by reinstantiating it. If the > attempt > >>> to instantiate C2 fails, the service unit becomes disabled, and SI2 > must > >>> be switched-over; however, if the attempt to instantiate C2 is > >>> successful, SI2 shall remain assigned to SU1, and based on other > >>> configuration parameters and N-way redundancy model semantics, even > >> SI1 > >>> might get reassigned to SU1." > >>> > >>> My comment on V2: > >>> > >>> The configuration in #2233 is different from the example in > >>> specification, but it sounds to me the attempt to instantiate failed > >>> component should be done as soon as possible. > >>> The check in V2 patch means the failed component won't be > instantiated > >>> if its SU still has any assignment. It should be true to 2N and N+M, > but > >>> it's not for other SG. (As the example in specification, S2 does not > >>> have any CSI assigned to failed component C2). > >> [Praveen]As of now we have documented in the PR doc (conformance table > >> section 3.11.1.3 Recovery) that if a component faults with > comp-failover > >> recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active > >> and > >> N+M models also. This is just to highlight about other red models. But > >> this documentation is not clear for an unassigned comp. > >> But from the beginning, comp-failover is working this way only. > At-least > >> from clean up perspective we have fixed the problem of parallelism in > >> the past in the ticket #474. > >> > >> One more thing I have noted, proxy-proxied implementation is based on > >> B.01.01. As per B.01.01, proxy will register himself and its proxied > as > >> soon as it gets instantiated. In a configuration containing both proxy > >> and proxied comp, if the proxy does not get any CSI and it faults with > >> comp-failover recovery then in instantiation phase it may again > register > >> its proxy. I think proxy in other SU should register its proxied. I > >> guess, from deployment perspective such a configuration in which a > user > >> configures proxy without any CSI may not exists and only possibility > is > >> an application modeling a legacy code in NoRed model. However, in the > >> later version of spec B.01.02, proxied was supposed to mention the > name > >> of proxy CSI and thus proxy should register only when its get proxy > CSI. > >> > >> One more point to be noted comp-failover can also be done as a part of > >> escalation also. If a component is instantiated before the completion > of > >> comp-failover recovery and if faults again then it may escalate to > >> node-failover before completion of comp-failover recovery. > >> > >> Since in spec there is no specific discussion for comp-failover > recovery > >> for an unassigned comp, I will encourage other maintainers also to > >> provide inputs. > >> > >> > >> Thanks, > >> Praveen > >> > >> > >> Moreover, in the clc.cc, > >>> amfnd does not check any of si_list.n_nodes, this probably is the > logic > >>> that has being done so far. > >>> > >>> Thanks, > >>> Minh > >>> > >>> On 17/02/17 23:16, praveen malviya wrote: > >>>> Hi Minh, > >>>> > >>>> I think we should see this problem from fault management perspective > >>>> also. Here repair of failed component is performed before the > >>>> completion of recovery.In the problem, component faulted with > >>>> comp-failover recovery and it was successfully > repaired(instantiated) > >>>> when SU switch-over was still pending. > >>>> > >>>> Now the question is: Why it was never observed earlier? The reason > is > >>>> generally all components are assigned at least one CSI. In the > present > >>>> configuration failed component was not assigned any CSI. When this > >>>> component was cleaned up and marked UNINSTANTIATED, AMFND sent > >>>> comp-failover recovery request to AMFD. But after sending recovery > >>>> request, it instantiated failed comp when SU has still assignments > to > >>>> be switch-overed. The code related to this assumes that comp will > have > >>>> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(), > >>>> TERMINATING to UNINSTANTIATED if block). For normal sequence of > >>>> comp-failover, su is repaired after removal of assignment in > >>>> avnd_su_si_oper_done() by calling avnd_err_su_repair(). > >>>> > >>>> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page > >>>> 195) about switch-overing all the SIs of failed SU in case of > >>>> comp-failed recovery and not for other models. In current OpenSAF > >>>> implementation we are following this for all models. > >>>> > >>>> I think as a fix we should stop failed comp to get instantiated > before > >>>> removal of assignments. For this the check in clc.cc can be hardened > >>>> to consider non-assigned comp failures. > >>>> Attached is the patch (2233_v2.patch) based on this idea/approach. > >>>> > >>>> Thanks, > >>>> Praveen > >>>> > >>>> > >>>> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote: > >>>>> Hi Praveen, > >>>>> > >>>>> Yes, you are right, I will update the description. > >>>>> > >>>>> Thanks, Minh > >>>>> > >>>>> Quoting praveen malviya <praveen.malv...@oracle.com>: > >>>>> > >>>>>> Hi Minh, > >>>>>> > >>>>>> One quick question: > >>>>>> Ticket description says: > >>>>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 > >> depends > >>>>>> safSi=AmfDemoTwon" > >>>>>> But logs are related to without SIdep. Also in the configuration > >>>>>> app3_twon3su3si.xml, SI dep classes are commented. > >>>>>> I think ticket description needs correction as problem is without > SI > >>>>>> dep. > >>>>>> Please confirm. > >>>>>> > >>>>>> Thanks, > >>>>>> Praveen > >>>>>> > >>>>>> > >>>>>> On 17-Feb-17 10:58 AM, praveen malviya wrote: > >>>>>>> Hi Minh, > >>>>>>> > >>>>>>> I have started reviewing this patch. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Praveen > >>>>>>> > >>>>>>> On 15-Feb-17 9:22 AM, minh chau wrote: > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> Have you had time to review this patch? > >>>>>>>> It changes the component failover sequence, so I think we need > >> more > >>>>>>>> time > >>>>>>>> to look at it. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Minh > >>>>>>>> > >>>>>>>> On 23/01/17 12:28, Minh Hon Chau wrote: > >>>>>>>>> src/amf/amfnd/avnd_su.h | 1 + > >>>>>>>>> src/amf/amfnd/clc.cc | 3 --- > >>>>>>>>> src/amf/amfnd/di.cc | 12 +++++++++++- > >>>>>>>>> src/amf/amfnd/susm.cc | 32 > >> +++++++++++++++++++++++++++++--- > >>>>>>>>> 4 files changed, 41 insertions(+), 7 deletions(-) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> In case component failover, faulty component will be > terminated. > >>>>>>>>> When > >>>>>>>>> the reinstantiation > >>>>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd > which > >> is > >>>>>>>>> running along with > >>>>>>>>> component failover. In the reported problem, if su_oper_message > >>>>>>>>> (enabled) comes to amfd > >>>>>>>>> before the quiesced assignment response (as part of component > >>>>>>>>> failover > >>>>>>>>> sequence) comes to > >>>>>>>>> amfd, then this quiesced assignment response is ignored, thus > >>>>>>>>> component failover will not > >>>>>>>>> finish. > >>>>>>>>> > >>>>>>>>> The problem is in function susi_success_sg_realign with act=5, > >>>>>>>>> state=3, amfd always assumes > >>>>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption > is > >>>>>>>>> true > >>>>>>>>> in most of the time > >>>>>>>>> when su_oper_message (enabled) comes a little later than > quiesced > >>>>>>>>> assignment response. In fact > >>>>>>>>> the su_oper_message (enabled) is not designed as part of > >> component > >>>>>>>>> failover sequence, thus it > >>>>>>>>> can come any time during the failover. If amfd is getting a bit > >>>>>>>>> busier > >>>>>>>>> with RTA update then > >>>>>>>>> the faulty component has enough to reinstiantiate so that amfnd > >>>>>>>>> sends > >>>>>>>>> su_oper_message (enabled) > >>>>>>>>> before quiesced assignment response, the reported problem will > be > >>>>>>>>> seen. > >>>>>>>>> > >>>>>>>>> This patch hardens the component failover sequence by ensuring > >> the > >>>>>>>>> su_oper_message (enabled) to > >>>>>>>>> be sent after su completes to remove assignment. This approach > >> comes > >>>>>>>>> from the similarity in > >>>>>>>>> su failover, where the su_oper_message (enabled) is sent in > repair > >>>>>>>>> phase. > >>>>>>>>> > >>>>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h > >>>>>>>>> --- a/src/amf/amfnd/avnd_su.h > >>>>>>>>> +++ b/src/amf/amfnd/avnd_su.h > >>>>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis > >>>>>>>>> extern struct avnd_su_si_rec *avnd_silist_getprev(const > struct > >>>>>>>>> avnd_su_si_rec *); > >>>>>>>>> extern struct avnd_su_si_rec *avnd_silist_getlast(void); > >>>>>>>>> extern bool sufailover_in_progress(const AVND_SU *su); > >>>>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su); > >>>>>>>>> extern bool sufailover_during_nodeswitchover(const AVND_SU > >> *su); > >>>>>>>>> extern bool all_csis_in_removed_state(const AVND_SU *su); > >>>>>>>>> extern void su_reset_restart_count_in_comps(const struct > >>>>>>>>> avnd_cb_tag > >>>>>>>>> *cb, const AVND_SU *su); > >>>>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc > >>>>>>>>> --- a/src/amf/amfnd/clc.cc > >>>>>>>>> +++ b/src/amf/amfnd/clc.cc > >>>>>>>>> @@ -2381,9 +2381,6 @@ uint32_t > >> avnd_comp_clc_terming_cleansucc > >>>>>>>>> (m_AVND_SU_IS_FAILOVER(su))) { > >>>>>>>>> /* yes, request director to orchestrate component > >>>>>>>>> failover */ > >>>>>>>>> rc = avnd_di_oper_send(cb, su, > >> SA_AMF_COMPONENT_FAILOVER); > >>>>>>>>> - > >>>>>>>>> - //Reset component-failover here. SU failover is reset > as > >>>>>>>>> part > >>>>>>>>> of REPAIRED admin op. > >>>>>>>>> - m_AVND_SU_FAILOVER_RESET(su); > >>>>>>>>> } > >>>>>>>>> /* > >>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc > >>>>>>>>> --- a/src/amf/amfnd/di.cc > >>>>>>>>> +++ b/src/amf/amfnd/di.cc > >>>>>>>>> @@ -894,7 +894,17 @@ uint32_t > >> avnd_di_susi_resp_send(AVND_CB > >>>>>>>>> } > >>>>>>>>> m_AVND_SU_ALL_SI_RESET(su); > >>>>>>>>> } > >>>>>>>>> - > >>>>>>>>> + if (componentfailover_in_progress(su)) { > >>>>>>>>> + if (all_csis_in_removed_state(su) == true) { > >>>>>>>>> + bool is_en; > >>>>>>>>> + m_AVND_SU_IS_ENABLED(su, is_en); > >>>>>>>>> + if (is_en) { > >>>>>>>>> + if (avnd_di_oper_send(cb, su, 0) == > >>>>>>>>> NCSCC_RC_SUCCESS) { > >>>>>>>>> + m_AVND_SU_FAILOVER_RESET(su); > >>>>>>>>> + } > >>>>>>>>> + } > >>>>>>>>> + } > >>>>>>>>> + } > >>>>>>>>> /* free the contents of avnd message */ > >>>>>>>>> avnd_msg_content_free(cb, &msg); > >>>>>>>>> diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc > >>>>>>>>> --- a/src/amf/amfnd/susm.cc > >>>>>>>>> +++ b/src/amf/amfnd/susm.cc > >>>>>>>>> @@ -1633,10 +1633,22 @@ uint32_t > >> avnd_su_pres_st_chng_prc(AVND_C > >>>>>>>>> m_AVND_SU_IS_ENABLED(su, is_en); > >>>>>>>>> if (true == is_en) { > >>>>>>>>> TRACE("SU oper state is enabled"); > >>>>>>>>> + // do not send su_oper state if component > >>>>>>>>> failover is > >>>>>>>>> in progress > >>>>>>>>> m_AVND_SU_OPER_STATE_SET(su, > >>>>>>>>> SA_AMF_OPERATIONAL_ENABLED); > >>>>>>>>> - rc = avnd_di_oper_send(cb, su, 0); > >>>>>>>>> - if (NCSCC_RC_SUCCESS != rc) > >>>>>>>>> - goto done; > >>>>>>>>> + if (componentfailover_in_progress(su) == > true) { > >>>>>>>>> + si = reinterpret_cast<AVND_SU_SI_REC*> > >>>>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list)); > >>>>>>>>> + if (si == nullptr || > >>>>>>>>> all_csis_in_removed_state(su)) { > >>>>>>>>> + rc = avnd_di_oper_send(cb, su, 0); > >>>>>>>>> + if (rc != NCSCC_RC_SUCCESS) > >>>>>>>>> + goto done; > >>>>>>>>> + m_AVND_SU_FAILOVER_RESET(su); > >>>>>>>>> + } > >>>>>>>>> + } else { > >>>>>>>>> + rc = avnd_di_oper_send(cb, su, 0); > >>>>>>>>> + if (NCSCC_RC_SUCCESS != rc) > >>>>>>>>> + goto done; > >>>>>>>>> + } > >>>>>>>>> } > >>>>>>>>> else > >>>>>>>>> TRACE("SU oper state is disabled"); > >>>>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const > >> AVND_S > >>>>>>>>> } > >>>>>>>>> /** > >>>>>>>>> + * This function checks if the componentfailover is going on. > >>>>>>>>> + * @param su: ptr to the SU . > >>>>>>>>> + * > >>>>>>>>> + * @return true/false. > >>>>>>>>> + */ > >>>>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) { > >>>>>>>>> + if ((su->sufailover == false) && > (!m_AVND_SU_IS_RESTART(su)) > >> && > >>>>>>>>> + (avnd_cb->oper_state != > >> SA_AMF_OPERATIONAL_DISABLED) && > >>>>>>>>> (!su->is_ncs) && > >>>>>>>>> + m_AVND_SU_IS_FAILOVER(su)) > >>>>>>>>> + return true; > >>>>>>>>> + return false; > >>>>>>>>> +} > >>>>>>>>> + > >>>>>>>>> +/** > >>>>>>>>> * This function checks if the sufailover and node > switchover are > >>>>>>>>> going on. > >>>>>>>>> * @param su: ptr to the SU . > >>>>>>>>> * > >>>>>>>>> > >>>>>>> > ------------------------------------------------------------------------------ > >>>>>>> > >>>>>>> > >>>>>>> Check out the vibrant tech community on one of the world's most > >>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot > >>>>>>> _______________________________________________ > >>>>>>> Opensaf-devel mailing list > >>>>>>> Opensaf-devel@lists.sourceforge.net > >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > >>>>>>> > >>>>> > > > > > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel