Hi,

Thanks Gary.
@Nagu, Praveen: Have you had time to check the example in my previous email?
The ticket #2179 is about to document that full escalation is supported 
for SC absence feature, it is waiting for fix of #2233.
I think there's not big change in code for #2233, it's a matter of 
decision to make for re-instantiation of failed component.

Thanks,
Minh

On 01/03/17 15:42, Gary Lee wrote:
> Hi
>
> It seems the component should be re-instantiated if it has no CSI. Whether or 
> not there is an SI assigned should be irrelevant?
>
> Thanks
> Gary
>
> -----Original Message-----
> From: minh chau <minh.c...@dektech.com.au>
> Date: Thursday, 23 February 2017 at 3:16 pm
> To: Nagendra Kumar <nagendr...@oracle.com>, Praveen Malviya 
> <praveen.malv...@oracle.com>
> Cc: <hans.nordeb...@ericsson.com>, gary <gary....@dektech.com.au>, 
> <long.hb.ngu...@dektech.com.au>, <opensaf-devel@lists.sourceforge.net>
> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message 
> synchronizes with component failover sequence [#2233]
>
>      Hi Nagu, Praveen,
>      
>      Please find my comment in [Minh3]
>      
>      Thanks,
>      Minh
>      
>      On 22/02/17 19:34, Nagendra Kumar wrote:
>      >>> Since in spec there is no specific discussion for comp-failover 
> recovery for an unassigned comp, I will encourage other maintainers also to 
> provide inputs.
>      > I do agree for not instantiating failed component before recovery, 
> this keeps the approach similar to SU failover also.
>      [Minh3]: There's one example of component failover that I would like us
>      to have a look
>      - 2N application, SU4/SU5 has active/standby assignment respectively,
>      each SU has 3 components
>      - Add a sleep of 10 seconds in clc script start command of first
>      component C41 of SU4
>      Steps:
>      1- Kill C41 to trigger component failover
>      2- SU4 goes for quiesced assignment
>      3- SU5 goes for active assignment
>      4- SU4 is removed its assignment
>      5- Now there's a pause of 10 seconds due to clc script start, to ensure
>      that C41 is healthy
>      6- Next SU4 has standby assignment.
>      
>       From the above example, I think we can see some problems if the
>      re-instantiation of C41 is delayed:
>      - Because C41 is faulty, it needs to be restarted ok because its SU has
>      assignment
>      - Moving re-instantiation of C41 is further down that means the recovery
>      will take longer
>      - What if re-instantiation of C41 leads to instantation-failed
>      
>      Whether or not the C41 has assignment or is unassigned, the
>      OperState/PresenceState result from re-instantiation of faulty C41
>      affects to SU4's eligibility for assignment.
>      There's a parallelism between [restart of faulty component C41] and
>      [movement from Active->Quiesced->Removed assignment of SU4], it's good
>      to have and it's current behavior of amfnd.
>      It's my understanding so far but I could be wrong. Let's check with Hans
>      and Gary.
>      
>      >
>      > @Minh: If you don't mind, we can take su oper state changes in an 
> enhancement. What do you say ?
>      [Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 
> release.
>      >
>      > Thanks
>      > -Nagu
>      >
>      >> -----Original Message-----
>      >> From: praveen malviya
>      >> Sent: 21 February 2017 11:15
>      >> To: minh chau
>      >> Cc: hans.nordeb...@ericsson.com; Nagendra Kumar;
>      >> gary....@dektech.com.au; long.hb.ngu...@dektech.com.au; opensaf-
>      >> de...@lists.sourceforge.net
>      >> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational 
> message
>      >> synchronizes with component failover sequence [#2233]
>      >>
>      >> Hi Minh,
>      >>
>      >> Please find my response inline with [Praveen].
>      >>
>      >> Thanks,
>      >> Praveen
>      >>
>      >> On 20-Feb-17 6:58 AM, minh chau wrote:
>      >>> Hi Praveen,
>      >>>
>      >>> Thanks for your V2 patch, I have tested V2 in scenario of ticket 
> #2233
>      >>> and #1902, it also can fix the problem.
>      >>> Here we have 2 solutions:
>      >>> - The one I sent for review is letting the failed component to be
>      >>> instantiated, I think it is current behavior. But one change is that
>      >>> amfnd will not report su operational message to amfd until amfnd
>      >>> finishes removing the assignment of (faulty) su which contains the
>      >>> failed component
>      >>> - The V2 patch postpones the instantiation of failed component. amfnd
>      >>> will instantiate the failed component (via avnd_err_su_repair) after
>      >>> amfnd finishes removing the assignment of faulty su.
>      >>>
>      >>> So basically the difference is the time that the failed component 
> should
>      >>> be instantiated.
>      >>>
>      >>> Still in item 3.11.1.3.2:
>      >>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
>      >>> other-wise, the number of active service units would be higher than 
> what
>      >>> is allowed by the redundancy model. However, in an Nway redundancy
>      >>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
>      >>> configuration attribute of the ser-vice unit is set to SA_FALSE), 
> and a
>      >>> repair of C2 should be attempted by reinstantiating it. If the 
> attempt
>      >>> to instantiate C2 fails, the service unit becomes disabled, and SI2 
> must
>      >>> be switched-over; however, if the attempt to instantiate C2 is
>      >>> successful, SI2 shall remain assigned to SU1, and based on other
>      >>> configuration parameters and N-way redundancy model semantics, even
>      >> SI1
>      >>> might get reassigned to SU1."
>      >>>
>      >>> My comment on V2:
>      >>>
>      >>> The configuration in #2233 is different from the example in
>      >>> specification, but it sounds to me the attempt to instantiate failed
>      >>> component should be done as soon as possible.
>      >>> The check in V2 patch means the failed component won't be 
> instantiated
>      >>> if its SU still has any assignment. It should be true to 2N and N+M, 
> but
>      >>> it's not for other SG. (As the example in specification, S2 does not
>      >>> have any CSI assigned to failed component C2).
>      >> [Praveen]As of now we have documented in the PR doc (conformance table
>      >> section 3.11.1.3 Recovery) that if a component faults with 
> comp-failover
>      >> recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active
>      >> and
>      >> N+M models also. This is just to highlight about other red models. But
>      >> this documentation is not clear for an unassigned comp.
>      >> But from the beginning, comp-failover is working this way only. 
> At-least
>      >> from clean up perspective we have fixed the problem of parallelism in
>      >> the past in the ticket #474.
>      >>
>      >> One more thing I have noted, proxy-proxied implementation is based on
>      >> B.01.01. As per B.01.01, proxy will register himself and its proxied 
> as
>      >> soon as it gets instantiated. In a configuration containing both proxy
>      >> and proxied comp, if the proxy does not get any CSI and it faults with
>      >> comp-failover recovery then in instantiation phase it may again 
> register
>      >> its proxy. I think proxy in other SU should register its proxied. I
>      >> guess, from deployment perspective such a configuration in which a 
> user
>      >> configures proxy without any CSI may not exists and only possibility 
> is
>      >> an application modeling a legacy code in NoRed model. However, in the
>      >> later version of spec B.01.02, proxied was supposed to mention the 
> name
>      >> of proxy CSI and thus proxy should register only when its get proxy 
> CSI.
>      >>
>      >> One more point to be noted comp-failover can also be done as a part of
>      >> escalation also. If a component is instantiated before the completion 
> of
>      >> comp-failover recovery and if faults again then it may escalate to
>      >> node-failover before completion of comp-failover recovery.
>      >>
>      >> Since in spec there is no specific discussion for comp-failover 
> recovery
>      >> for an unassigned comp, I will encourage other maintainers also to
>      >> provide inputs.
>      >>
>      >>
>      >> Thanks,
>      >> Praveen
>      >>
>      >>
>      >>    Moreover, in the clc.cc,
>      >>> amfnd does not check any of si_list.n_nodes, this probably is the 
> logic
>      >>> that has being done so far.
>      >>>
>      >>> Thanks,
>      >>> Minh
>      >>>
>      >>> On 17/02/17 23:16, praveen malviya wrote:
>      >>>> Hi Minh,
>      >>>>
>      >>>> I think we should see this problem from fault management perspective
>      >>>> also. Here repair of failed component is performed before the
>      >>>> completion of recovery.In the problem, component faulted with
>      >>>> comp-failover recovery and it was successfully 
> repaired(instantiated)
>      >>>> when SU switch-over was still pending.
>      >>>>
>      >>>> Now the question is: Why it was never observed earlier? The reason 
> is
>      >>>> generally all components are assigned at least one CSI. In the 
> present
>      >>>> configuration failed component was not assigned any CSI. When this
>      >>>> component was cleaned up and marked UNINSTANTIATED, AMFND sent
>      >>>> comp-failover recovery request to AMFD. But after sending recovery
>      >>>> request, it instantiated failed comp when SU has still assignments 
> to
>      >>>> be switch-overed. The code related to this assumes that comp will 
> have
>      >>>> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(),
>      >>>> TERMINATING to UNINSTANTIATED if block). For normal sequence of
>      >>>> comp-failover, su is repaired after removal of assignment in
>      >>>> avnd_su_si_oper_done() by calling avnd_err_su_repair().
>      >>>>
>      >>>> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page
>      >>>> 195) about switch-overing all the SIs of failed SU in case of
>      >>>> comp-failed recovery and not for other models. In current OpenSAF
>      >>>> implementation we are following this for all models.
>      >>>>
>      >>>> I think as a fix we should stop failed comp to get instantiated 
> before
>      >>>> removal of assignments. For this the check in clc.cc can be hardened
>      >>>> to consider non-assigned comp failures.
>      >>>> Attached is the patch (2233_v2.patch) based on this idea/approach.
>      >>>>
>      >>>> Thanks,
>      >>>> Praveen
>      >>>>
>      >>>>
>      >>>> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
>      >>>>> Hi Praveen,
>      >>>>>
>      >>>>> Yes, you are right, I will update the description.
>      >>>>>
>      >>>>> Thanks, Minh
>      >>>>>
>      >>>>> Quoting praveen malviya <praveen.malv...@oracle.com>:
>      >>>>>
>      >>>>>> Hi Minh,
>      >>>>>>
>      >>>>>> One quick question:
>      >>>>>> Ticket description says:
>      >>>>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1
>      >> depends
>      >>>>>> safSi=AmfDemoTwon"
>      >>>>>> But logs are related to without SIdep. Also in the configuration
>      >>>>>> app3_twon3su3si.xml, SI dep classes are commented.
>      >>>>>> I think ticket description needs correction as problem is without 
> SI
>      >>>>>> dep.
>      >>>>>> Please confirm.
>      >>>>>>
>      >>>>>> Thanks,
>      >>>>>> Praveen
>      >>>>>>
>      >>>>>>
>      >>>>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
>      >>>>>>> Hi Minh,
>      >>>>>>>
>      >>>>>>> I have started reviewing this patch.
>      >>>>>>>
>      >>>>>>> Thanks,
>      >>>>>>> Praveen
>      >>>>>>>
>      >>>>>>> On 15-Feb-17 9:22 AM, minh chau wrote:
>      >>>>>>>> Hi all,
>      >>>>>>>>
>      >>>>>>>> Have you had time to review this patch?
>      >>>>>>>> It changes the component failover sequence, so I think we need
>      >> more
>      >>>>>>>> time
>      >>>>>>>> to look at it.
>      >>>>>>>>
>      >>>>>>>> Thanks,
>      >>>>>>>> Minh
>      >>>>>>>>
>      >>>>>>>> On 23/01/17 12:28, Minh Hon Chau wrote:
>      >>>>>>>>>   src/amf/amfnd/avnd_su.h |   1 +
>      >>>>>>>>>   src/amf/amfnd/clc.cc    |   3 ---
>      >>>>>>>>>   src/amf/amfnd/di.cc     |  12 +++++++++++-
>      >>>>>>>>>   src/amf/amfnd/susm.cc   |  32
>      >> +++++++++++++++++++++++++++++---
>      >>>>>>>>>   4 files changed, 41 insertions(+), 7 deletions(-)
>      >>>>>>>>>
>      >>>>>>>>>
>      >>>>>>>>> In case component failover, faulty component will be 
> terminated.
>      >>>>>>>>> When
>      >>>>>>>>> the reinstantiation
>      >>>>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd 
> which
>      >> is
>      >>>>>>>>> running along with
>      >>>>>>>>> component failover. In the reported problem, if su_oper_message
>      >>>>>>>>> (enabled) comes to amfd
>      >>>>>>>>> before the quiesced assignment response (as part of component
>      >>>>>>>>> failover
>      >>>>>>>>> sequence) comes to
>      >>>>>>>>> amfd, then this quiesced assignment response is ignored, thus
>      >>>>>>>>> component failover will not
>      >>>>>>>>> finish.
>      >>>>>>>>>
>      >>>>>>>>> The problem is in function susi_success_sg_realign with act=5,
>      >>>>>>>>> state=3, amfd always assumes
>      >>>>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption 
> is
>      >>>>>>>>> true
>      >>>>>>>>> in most of the time
>      >>>>>>>>> when su_oper_message (enabled) comes a little later than 
> quiesced
>      >>>>>>>>> assignment response. In fact
>      >>>>>>>>> the su_oper_message (enabled) is not designed as part of
>      >> component
>      >>>>>>>>> failover sequence, thus it
>      >>>>>>>>> can come any time during the failover. If amfd is getting a bit
>      >>>>>>>>> busier
>      >>>>>>>>> with RTA update then
>      >>>>>>>>> the faulty component has enough to reinstiantiate so that amfnd
>      >>>>>>>>> sends
>      >>>>>>>>> su_oper_message (enabled)
>      >>>>>>>>> before quiesced assignment response, the reported problem will 
> be
>      >>>>>>>>> seen.
>      >>>>>>>>>
>      >>>>>>>>> This patch hardens the component failover sequence by ensuring
>      >> the
>      >>>>>>>>> su_oper_message (enabled) to
>      >>>>>>>>> be sent after su completes to remove assignment. This approach
>      >> comes
>      >>>>>>>>> from the similarity in
>      >>>>>>>>> su failover, where the su_oper_message (enabled) is sent in 
> repair
>      >>>>>>>>> phase.
>      >>>>>>>>>
>      >>>>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
>      >>>>>>>>> --- a/src/amf/amfnd/avnd_su.h
>      >>>>>>>>> +++ b/src/amf/amfnd/avnd_su.h
>      >>>>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
>      >>>>>>>>>   extern struct avnd_su_si_rec *avnd_silist_getprev(const 
> struct
>      >>>>>>>>> avnd_su_si_rec *);
>      >>>>>>>>>   extern struct avnd_su_si_rec *avnd_silist_getlast(void);
>      >>>>>>>>>   extern bool sufailover_in_progress(const AVND_SU *su);
>      >>>>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su);
>      >>>>>>>>>   extern bool sufailover_during_nodeswitchover(const AVND_SU
>      >> *su);
>      >>>>>>>>>   extern bool all_csis_in_removed_state(const AVND_SU *su);
>      >>>>>>>>>   extern void su_reset_restart_count_in_comps(const struct
>      >>>>>>>>> avnd_cb_tag
>      >>>>>>>>> *cb, const AVND_SU *su);
>      >>>>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
>      >>>>>>>>> --- a/src/amf/amfnd/clc.cc
>      >>>>>>>>> +++ b/src/amf/amfnd/clc.cc
>      >>>>>>>>> @@ -2381,9 +2381,6 @@ uint32_t
>      >> avnd_comp_clc_terming_cleansucc
>      >>>>>>>>>               (m_AVND_SU_IS_FAILOVER(su))) {
>      >>>>>>>>>           /* yes, request director to orchestrate component
>      >>>>>>>>> failover */
>      >>>>>>>>>           rc = avnd_di_oper_send(cb, su,
>      >> SA_AMF_COMPONENT_FAILOVER);
>      >>>>>>>>> -
>      >>>>>>>>> -        //Reset component-failover here. SU failover is reset 
> as
>      >>>>>>>>> part
>      >>>>>>>>> of REPAIRED admin op.
>      >>>>>>>>> -        m_AVND_SU_FAILOVER_RESET(su);
>      >>>>>>>>>       }
>      >>>>>>>>>         /*
>      >>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>      >>>>>>>>> --- a/src/amf/amfnd/di.cc
>      >>>>>>>>> +++ b/src/amf/amfnd/di.cc
>      >>>>>>>>> @@ -894,7 +894,17 @@ uint32_t
>      >> avnd_di_susi_resp_send(AVND_CB
>      >>>>>>>>>               }
>      >>>>>>>>>               m_AVND_SU_ALL_SI_RESET(su);
>      >>>>>>>>>           }
>      >>>>>>>>> -
>      >>>>>>>>> +        if (componentfailover_in_progress(su)) {
>      >>>>>>>>> +            if (all_csis_in_removed_state(su) == true) {
>      >>>>>>>>> +                bool is_en;
>      >>>>>>>>> +                m_AVND_SU_IS_ENABLED(su, is_en);
>      >>>>>>>>> +                if (is_en) {
>      >>>>>>>>> +                    if (avnd_di_oper_send(cb, su, 0) ==
>      >>>>>>>>> NCSCC_RC_SUCCESS) {
>      >>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>      >>>>>>>>> +                    }
>      >>>>>>>>> +                }
>      >>>>>>>>> +            }
>      >>>>>>>>> +        }
>      >>>>>>>>>       /* free the contents of avnd message */
>      >>>>>>>>>       avnd_msg_content_free(cb, &msg);
>      >>>>>>>>>   diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
>      >>>>>>>>> --- a/src/amf/amfnd/susm.cc
>      >>>>>>>>> +++ b/src/amf/amfnd/susm.cc
>      >>>>>>>>> @@ -1633,10 +1633,22 @@ uint32_t
>      >> avnd_su_pres_st_chng_prc(AVND_C
>      >>>>>>>>>               m_AVND_SU_IS_ENABLED(su, is_en);
>      >>>>>>>>>               if (true == is_en) {
>      >>>>>>>>>                   TRACE("SU oper state is enabled");
>      >>>>>>>>> +                // do not send su_oper state if component
>      >>>>>>>>> failover is
>      >>>>>>>>> in progress
>      >>>>>>>>>                   m_AVND_SU_OPER_STATE_SET(su,
>      >>>>>>>>> SA_AMF_OPERATIONAL_ENABLED);
>      >>>>>>>>> -                rc = avnd_di_oper_send(cb, su, 0);
>      >>>>>>>>> -                if (NCSCC_RC_SUCCESS != rc)
>      >>>>>>>>> -                    goto done;
>      >>>>>>>>> +                if (componentfailover_in_progress(su) == 
> true) {
>      >>>>>>>>> +                    si = reinterpret_cast<AVND_SU_SI_REC*>
>      >>>>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
>      >>>>>>>>> +                    if (si == nullptr ||
>      >>>>>>>>> all_csis_in_removed_state(su)) {
>      >>>>>>>>> +                        rc = avnd_di_oper_send(cb, su, 0);
>      >>>>>>>>> +                        if (rc != NCSCC_RC_SUCCESS)
>      >>>>>>>>> +                            goto done;
>      >>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>      >>>>>>>>> +                    }
>      >>>>>>>>> +                } else {
>      >>>>>>>>> +                    rc = avnd_di_oper_send(cb, su, 0);
>      >>>>>>>>> +                    if (NCSCC_RC_SUCCESS != rc)
>      >>>>>>>>> +                        goto done;
>      >>>>>>>>> +                }
>      >>>>>>>>>               }
>      >>>>>>>>>               else
>      >>>>>>>>>                   TRACE("SU oper state is disabled");
>      >>>>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const
>      >> AVND_S
>      >>>>>>>>>   }
>      >>>>>>>>>     /**
>      >>>>>>>>> + * This function checks if the componentfailover is going on.
>      >>>>>>>>> + * @param su: ptr to the SU .
>      >>>>>>>>> + *
>      >>>>>>>>> + * @return true/false.
>      >>>>>>>>> + */
>      >>>>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) {
>      >>>>>>>>> +    if ((su->sufailover == false) && 
> (!m_AVND_SU_IS_RESTART(su))
>      >> &&
>      >>>>>>>>> +            (avnd_cb->oper_state !=
>      >> SA_AMF_OPERATIONAL_DISABLED) &&
>      >>>>>>>>> (!su->is_ncs) &&
>      >>>>>>>>> +            m_AVND_SU_IS_FAILOVER(su))
>      >>>>>>>>> +        return true;
>      >>>>>>>>> +    return false;
>      >>>>>>>>> +}
>      >>>>>>>>> +
>      >>>>>>>>> +/**
>      >>>>>>>>>    * This function checks if the sufailover and node 
> switchover are
>      >>>>>>>>> going on.
>      >>>>>>>>>    * @param su: ptr to the SU .
>      >>>>>>>>>    *
>      >>>>>>>>>
>      >>>>>>> 
> ------------------------------------------------------------------------------
>      >>>>>>>
>      >>>>>>>
>      >>>>>>> Check out the vibrant tech community on one of the world's most
>      >>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>      >>>>>>> _______________________________________________
>      >>>>>>> Opensaf-devel mailing list
>      >>>>>>> Opensaf-devel@lists.sourceforge.net
>      >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>      >>>>>>>
>      >>>>>
>      
>      
>
>
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to