Hi It seems the component should be re-instantiated if it has no CSI. Whether or not there is an SI assigned should be irrelevant?
Thanks Gary -----Original Message----- From: minh chau <[email protected]> Date: Thursday, 23 February 2017 at 3:16 pm To: Nagendra Kumar <[email protected]>, Praveen Malviya <[email protected]> Cc: <[email protected]>, gary <[email protected]>, <[email protected]>, <[email protected]> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233] Hi Nagu, Praveen, Please find my comment in [Minh3] Thanks, Minh On 22/02/17 19:34, Nagendra Kumar wrote: >>> Since in spec there is no specific discussion for comp-failover recovery for an unassigned comp, I will encourage other maintainers also to provide inputs. > I do agree for not instantiating failed component before recovery, this keeps the approach similar to SU failover also. [Minh3]: There's one example of component failover that I would like us to have a look - 2N application, SU4/SU5 has active/standby assignment respectively, each SU has 3 components - Add a sleep of 10 seconds in clc script start command of first component C41 of SU4 Steps: 1- Kill C41 to trigger component failover 2- SU4 goes for quiesced assignment 3- SU5 goes for active assignment 4- SU4 is removed its assignment 5- Now there's a pause of 10 seconds due to clc script start, to ensure that C41 is healthy 6- Next SU4 has standby assignment. From the above example, I think we can see some problems if the re-instantiation of C41 is delayed: - Because C41 is faulty, it needs to be restarted ok because its SU has assignment - Moving re-instantiation of C41 is further down that means the recovery will take longer - What if re-instantiation of C41 leads to instantation-failed Whether or not the C41 has assignment or is unassigned, the OperState/PresenceState result from re-instantiation of faulty C41 affects to SU4's eligibility for assignment. There's a parallelism between [restart of faulty component C41] and [movement from Active->Quiesced->Removed assignment of SU4], it's good to have and it's current behavior of amfnd. It's my understanding so far but I could be wrong. Let's check with Hans and Gary. > > @Minh: If you don't mind, we can take su oper state changes in an enhancement. What do you say ? [Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 release. > > Thanks > -Nagu > >> -----Original Message----- >> From: praveen malviya >> Sent: 21 February 2017 11:15 >> To: minh chau >> Cc: [email protected]; Nagendra Kumar; >> [email protected]; [email protected]; opensaf- >> [email protected] >> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message >> synchronizes with component failover sequence [#2233] >> >> Hi Minh, >> >> Please find my response inline with [Praveen]. >> >> Thanks, >> Praveen >> >> On 20-Feb-17 6:58 AM, minh chau wrote: >>> Hi Praveen, >>> >>> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233 >>> and #1902, it also can fix the problem. >>> Here we have 2 solutions: >>> - The one I sent for review is letting the failed component to be >>> instantiated, I think it is current behavior. But one change is that >>> amfnd will not report su operational message to amfd until amfnd >>> finishes removing the assignment of (faulty) su which contains the >>> failed component >>> - The V2 patch postpones the instantiation of failed component. amfnd >>> will instantiate the failed component (via avnd_err_su_repair) after >>> amfnd finishes removing the assignment of faulty su. >>> >>> So basically the difference is the time that the failed component should >>> be instantiated. >>> >>> Still in item 3.11.1.3.2: >>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over; >>> other-wise, the number of active service units would be higher than what >>> is allowed by the redundancy model. However, in an Nway redundancy >>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover >>> configuration attribute of the ser-vice unit is set to SA_FALSE), and a >>> repair of C2 should be attempted by reinstantiating it. If the attempt >>> to instantiate C2 fails, the service unit becomes disabled, and SI2 must >>> be switched-over; however, if the attempt to instantiate C2 is >>> successful, SI2 shall remain assigned to SU1, and based on other >>> configuration parameters and N-way redundancy model semantics, even >> SI1 >>> might get reassigned to SU1." >>> >>> My comment on V2: >>> >>> The configuration in #2233 is different from the example in >>> specification, but it sounds to me the attempt to instantiate failed >>> component should be done as soon as possible. >>> The check in V2 patch means the failed component won't be instantiated >>> if its SU still has any assignment. It should be true to 2N and N+M, but >>> it's not for other SG. (As the example in specification, S2 does not >>> have any CSI assigned to failed component C2). >> [Praveen]As of now we have documented in the PR doc (conformance table >> section 3.11.1.3 Recovery) that if a component faults with comp-failover >> recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active >> and >> N+M models also. This is just to highlight about other red models. But >> this documentation is not clear for an unassigned comp. >> But from the beginning, comp-failover is working this way only. At-least >> from clean up perspective we have fixed the problem of parallelism in >> the past in the ticket #474. >> >> One more thing I have noted, proxy-proxied implementation is based on >> B.01.01. As per B.01.01, proxy will register himself and its proxied as >> soon as it gets instantiated. In a configuration containing both proxy >> and proxied comp, if the proxy does not get any CSI and it faults with >> comp-failover recovery then in instantiation phase it may again register >> its proxy. I think proxy in other SU should register its proxied. I >> guess, from deployment perspective such a configuration in which a user >> configures proxy without any CSI may not exists and only possibility is >> an application modeling a legacy code in NoRed model. However, in the >> later version of spec B.01.02, proxied was supposed to mention the name >> of proxy CSI and thus proxy should register only when its get proxy CSI. >> >> One more point to be noted comp-failover can also be done as a part of >> escalation also. If a component is instantiated before the completion of >> comp-failover recovery and if faults again then it may escalate to >> node-failover before completion of comp-failover recovery. >> >> Since in spec there is no specific discussion for comp-failover recovery >> for an unassigned comp, I will encourage other maintainers also to >> provide inputs. >> >> >> Thanks, >> Praveen >> >> >> Moreover, in the clc.cc, >>> amfnd does not check any of si_list.n_nodes, this probably is the logic >>> that has being done so far. >>> >>> Thanks, >>> Minh >>> >>> On 17/02/17 23:16, praveen malviya wrote: >>>> Hi Minh, >>>> >>>> I think we should see this problem from fault management perspective >>>> also. Here repair of failed component is performed before the >>>> completion of recovery.In the problem, component faulted with >>>> comp-failover recovery and it was successfully repaired(instantiated) >>>> when SU switch-over was still pending. >>>> >>>> Now the question is: Why it was never observed earlier? The reason is >>>> generally all components are assigned at least one CSI. In the present >>>> configuration failed component was not assigned any CSI. When this >>>> component was cleaned up and marked UNINSTANTIATED, AMFND sent >>>> comp-failover recovery request to AMFD. But after sending recovery >>>> request, it instantiated failed comp when SU has still assignments to >>>> be switch-overed. The code related to this assumes that comp will have >>>> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(), >>>> TERMINATING to UNINSTANTIATED if block). For normal sequence of >>>> comp-failover, su is repaired after removal of assignment in >>>> avnd_su_si_oper_done() by calling avnd_err_su_repair(). >>>> >>>> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page >>>> 195) about switch-overing all the SIs of failed SU in case of >>>> comp-failed recovery and not for other models. In current OpenSAF >>>> implementation we are following this for all models. >>>> >>>> I think as a fix we should stop failed comp to get instantiated before >>>> removal of assignments. For this the check in clc.cc can be hardened >>>> to consider non-assigned comp failures. >>>> Attached is the patch (2233_v2.patch) based on this idea/approach. >>>> >>>> Thanks, >>>> Praveen >>>> >>>> >>>> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote: >>>>> Hi Praveen, >>>>> >>>>> Yes, you are right, I will update the description. >>>>> >>>>> Thanks, Minh >>>>> >>>>> Quoting praveen malviya <[email protected]>: >>>>> >>>>>> Hi Minh, >>>>>> >>>>>> One quick question: >>>>>> Ticket description says: >>>>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 >> depends >>>>>> safSi=AmfDemoTwon" >>>>>> But logs are related to without SIdep. Also in the configuration >>>>>> app3_twon3su3si.xml, SI dep classes are commented. >>>>>> I think ticket description needs correction as problem is without SI >>>>>> dep. >>>>>> Please confirm. >>>>>> >>>>>> Thanks, >>>>>> Praveen >>>>>> >>>>>> >>>>>> On 17-Feb-17 10:58 AM, praveen malviya wrote: >>>>>>> Hi Minh, >>>>>>> >>>>>>> I have started reviewing this patch. >>>>>>> >>>>>>> Thanks, >>>>>>> Praveen >>>>>>> >>>>>>> On 15-Feb-17 9:22 AM, minh chau wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> Have you had time to review this patch? >>>>>>>> It changes the component failover sequence, so I think we need >> more >>>>>>>> time >>>>>>>> to look at it. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Minh >>>>>>>> >>>>>>>> On 23/01/17 12:28, Minh Hon Chau wrote: >>>>>>>>> src/amf/amfnd/avnd_su.h | 1 + >>>>>>>>> src/amf/amfnd/clc.cc | 3 --- >>>>>>>>> src/amf/amfnd/di.cc | 12 +++++++++++- >>>>>>>>> src/amf/amfnd/susm.cc | 32 >> +++++++++++++++++++++++++++++--- >>>>>>>>> 4 files changed, 41 insertions(+), 7 deletions(-) >>>>>>>>> >>>>>>>>> >>>>>>>>> In case component failover, faulty component will be terminated. >>>>>>>>> When >>>>>>>>> the reinstantiation >>>>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd which >> is >>>>>>>>> running along with >>>>>>>>> component failover. In the reported problem, if su_oper_message >>>>>>>>> (enabled) comes to amfd >>>>>>>>> before the quiesced assignment response (as part of component >>>>>>>>> failover >>>>>>>>> sequence) comes to >>>>>>>>> amfd, then this quiesced assignment response is ignored, thus >>>>>>>>> component failover will not >>>>>>>>> finish. >>>>>>>>> >>>>>>>>> The problem is in function susi_success_sg_realign with act=5, >>>>>>>>> state=3, amfd always assumes >>>>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption is >>>>>>>>> true >>>>>>>>> in most of the time >>>>>>>>> when su_oper_message (enabled) comes a little later than quiesced >>>>>>>>> assignment response. In fact >>>>>>>>> the su_oper_message (enabled) is not designed as part of >> component >>>>>>>>> failover sequence, thus it >>>>>>>>> can come any time during the failover. If amfd is getting a bit >>>>>>>>> busier >>>>>>>>> with RTA update then >>>>>>>>> the faulty component has enough to reinstiantiate so that amfnd >>>>>>>>> sends >>>>>>>>> su_oper_message (enabled) >>>>>>>>> before quiesced assignment response, the reported problem will be >>>>>>>>> seen. >>>>>>>>> >>>>>>>>> This patch hardens the component failover sequence by ensuring >> the >>>>>>>>> su_oper_message (enabled) to >>>>>>>>> be sent after su completes to remove assignment. This approach >> comes >>>>>>>>> from the similarity in >>>>>>>>> su failover, where the su_oper_message (enabled) is sent in repair >>>>>>>>> phase. >>>>>>>>> >>>>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h >>>>>>>>> --- a/src/amf/amfnd/avnd_su.h >>>>>>>>> +++ b/src/amf/amfnd/avnd_su.h >>>>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis >>>>>>>>> extern struct avnd_su_si_rec *avnd_silist_getprev(const struct >>>>>>>>> avnd_su_si_rec *); >>>>>>>>> extern struct avnd_su_si_rec *avnd_silist_getlast(void); >>>>>>>>> extern bool sufailover_in_progress(const AVND_SU *su); >>>>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su); >>>>>>>>> extern bool sufailover_during_nodeswitchover(const AVND_SU >> *su); >>>>>>>>> extern bool all_csis_in_removed_state(const AVND_SU *su); >>>>>>>>> extern void su_reset_restart_count_in_comps(const struct >>>>>>>>> avnd_cb_tag >>>>>>>>> *cb, const AVND_SU *su); >>>>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc >>>>>>>>> --- a/src/amf/amfnd/clc.cc >>>>>>>>> +++ b/src/amf/amfnd/clc.cc >>>>>>>>> @@ -2381,9 +2381,6 @@ uint32_t >> avnd_comp_clc_terming_cleansucc >>>>>>>>> (m_AVND_SU_IS_FAILOVER(su))) { >>>>>>>>> /* yes, request director to orchestrate component >>>>>>>>> failover */ >>>>>>>>> rc = avnd_di_oper_send(cb, su, >> SA_AMF_COMPONENT_FAILOVER); >>>>>>>>> - >>>>>>>>> - //Reset component-failover here. SU failover is reset as >>>>>>>>> part >>>>>>>>> of REPAIRED admin op. >>>>>>>>> - m_AVND_SU_FAILOVER_RESET(su); >>>>>>>>> } >>>>>>>>> /* >>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc >>>>>>>>> --- a/src/amf/amfnd/di.cc >>>>>>>>> +++ b/src/amf/amfnd/di.cc >>>>>>>>> @@ -894,7 +894,17 @@ uint32_t >> avnd_di_susi_resp_send(AVND_CB >>>>>>>>> } >>>>>>>>> m_AVND_SU_ALL_SI_RESET(su); >>>>>>>>> } >>>>>>>>> - >>>>>>>>> + if (componentfailover_in_progress(su)) { >>>>>>>>> + if (all_csis_in_removed_state(su) == true) { >>>>>>>>> + bool is_en; >>>>>>>>> + m_AVND_SU_IS_ENABLED(su, is_en); >>>>>>>>> + if (is_en) { >>>>>>>>> + if (avnd_di_oper_send(cb, su, 0) == >>>>>>>>> NCSCC_RC_SUCCESS) { >>>>>>>>> + m_AVND_SU_FAILOVER_RESET(su); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> /* free the contents of avnd message */ >>>>>>>>> avnd_msg_content_free(cb, &msg); >>>>>>>>> diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc >>>>>>>>> --- a/src/amf/amfnd/susm.cc >>>>>>>>> +++ b/src/amf/amfnd/susm.cc >>>>>>>>> @@ -1633,10 +1633,22 @@ uint32_t >> avnd_su_pres_st_chng_prc(AVND_C >>>>>>>>> m_AVND_SU_IS_ENABLED(su, is_en); >>>>>>>>> if (true == is_en) { >>>>>>>>> TRACE("SU oper state is enabled"); >>>>>>>>> + // do not send su_oper state if component >>>>>>>>> failover is >>>>>>>>> in progress >>>>>>>>> m_AVND_SU_OPER_STATE_SET(su, >>>>>>>>> SA_AMF_OPERATIONAL_ENABLED); >>>>>>>>> - rc = avnd_di_oper_send(cb, su, 0); >>>>>>>>> - if (NCSCC_RC_SUCCESS != rc) >>>>>>>>> - goto done; >>>>>>>>> + if (componentfailover_in_progress(su) == true) { >>>>>>>>> + si = reinterpret_cast<AVND_SU_SI_REC*> >>>>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list)); >>>>>>>>> + if (si == nullptr || >>>>>>>>> all_csis_in_removed_state(su)) { >>>>>>>>> + rc = avnd_di_oper_send(cb, su, 0); >>>>>>>>> + if (rc != NCSCC_RC_SUCCESS) >>>>>>>>> + goto done; >>>>>>>>> + m_AVND_SU_FAILOVER_RESET(su); >>>>>>>>> + } >>>>>>>>> + } else { >>>>>>>>> + rc = avnd_di_oper_send(cb, su, 0); >>>>>>>>> + if (NCSCC_RC_SUCCESS != rc) >>>>>>>>> + goto done; >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> else >>>>>>>>> TRACE("SU oper state is disabled"); >>>>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const >> AVND_S >>>>>>>>> } >>>>>>>>> /** >>>>>>>>> + * This function checks if the componentfailover is going on. >>>>>>>>> + * @param su: ptr to the SU . >>>>>>>>> + * >>>>>>>>> + * @return true/false. >>>>>>>>> + */ >>>>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) { >>>>>>>>> + if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) >> && >>>>>>>>> + (avnd_cb->oper_state != >> SA_AMF_OPERATIONAL_DISABLED) && >>>>>>>>> (!su->is_ncs) && >>>>>>>>> + m_AVND_SU_IS_FAILOVER(su)) >>>>>>>>> + return true; >>>>>>>>> + return false; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> * This function checks if the sufailover and node switchover are >>>>>>>>> going on. >>>>>>>>> * @param su: ptr to the SU . >>>>>>>>> * >>>>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> >>>>>>> >>>>>>> Check out the vibrant tech community on one of the world's most >>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>>>>>> _______________________________________________ >>>>>>> Opensaf-devel mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>>>>>> >>>>> ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
