Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

Gary Lee Tue, 28 Feb 2017 20:43:36 -0800

Hi

It seems the component should be re-instantiated if it has no CSI. Whether or 
not there is an SI assigned should be irrelevant?


Thanks
Gary

-----Original Message-----
From: minh chau <[email protected]>
Date: Thursday, 23 February 2017 at 3:16 pm
To: Nagendra Kumar <[email protected]>, Praveen Malviya 
<[email protected]>
Cc: <[email protected]>, gary <[email protected]>, 
<[email protected]>, <[email protected]>
Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message 
synchronizes with component failover sequence [#2233]

    Hi Nagu, Praveen,
    
    Please find my comment in [Minh3]
    
    Thanks,
    Minh
    
    On 22/02/17 19:34, Nagendra Kumar wrote:
    >>> Since in spec there is no specific discussion for comp-failover 
recovery for an unassigned comp, I will encourage other maintainers also to 
provide inputs.
    > I do agree for not instantiating failed component before recovery, this 
keeps the approach similar to SU failover also.
    [Minh3]: There's one example of component failover that I would like us 
    to have a look
    - 2N application, SU4/SU5 has active/standby assignment respectively, 
    each SU has 3 components
    - Add a sleep of 10 seconds in clc script start command of first 
    component C41 of SU4
    Steps:
    1- Kill C41 to trigger component failover
    2- SU4 goes for quiesced assignment
    3- SU5 goes for active assignment
    4- SU4 is removed its assignment
    5- Now there's a pause of 10 seconds due to clc script start, to ensure 
    that C41 is healthy
    6- Next SU4 has standby assignment.
    
     From the above example, I think we can see some problems if the 
    re-instantiation of C41 is delayed:
    - Because C41 is faulty, it needs to be restarted ok because its SU has 
    assignment
    - Moving re-instantiation of C41 is further down that means the recovery 
    will take longer
    - What if re-instantiation of C41 leads to instantation-failed
    
    Whether or not the C41 has assignment or is unassigned, the 
    OperState/PresenceState result from re-instantiation of faulty C41 
    affects to SU4's eligibility for assignment.
    There's a parallelism between [restart of faulty component C41] and 
    [movement from Active->Quiesced->Removed assignment of SU4], it's good 
    to have and it's current behavior of amfnd.
    It's my understanding so far but I could be wrong. Let's check with Hans 
    and Gary.
    
    >
    > @Minh: If you don't mind, we can take su oper state changes in an 
enhancement. What do you say ?
    [Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 release.
    >
    > Thanks
    > -Nagu
    >
    >> -----Original Message-----
    >> From: praveen malviya
    >> Sent: 21 February 2017 11:15
    >> To: minh chau
    >> Cc: [email protected]; Nagendra Kumar;
    >> [email protected]; [email protected]; opensaf-
    >> [email protected]
    >> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message
    >> synchronizes with component failover sequence [#2233]
    >>
    >> Hi Minh,
    >>
    >> Please find my response inline with [Praveen].
    >>
    >> Thanks,
    >> Praveen
    >>
    >> On 20-Feb-17 6:58 AM, minh chau wrote:
    >>> Hi Praveen,
    >>>
    >>> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
    >>> and #1902, it also can fix the problem.
    >>> Here we have 2 solutions:
    >>> - The one I sent for review is letting the failed component to be
    >>> instantiated, I think it is current behavior. But one change is that
    >>> amfnd will not report su operational message to amfd until amfnd
    >>> finishes removing the assignment of (faulty) su which contains the
    >>> failed component
    >>> - The V2 patch postpones the instantiation of failed component. amfnd
    >>> will instantiate the failed component (via avnd_err_su_repair) after
    >>> amfnd finishes removing the assignment of faulty su.
    >>>
    >>> So basically the difference is the time that the failed component should
    >>> be instantiated.
    >>>
    >>> Still in item 3.11.1.3.2:
    >>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
    >>> other-wise, the number of active service units would be higher than what
    >>> is allowed by the redundancy model. However, in an Nway redundancy
    >>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
    >>> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
    >>> repair of C2 should be attempted by reinstantiating it. If the attempt
    >>> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
    >>> be switched-over; however, if the attempt to instantiate C2 is
    >>> successful, SI2 shall remain assigned to SU1, and based on other
    >>> configuration parameters and N-way redundancy model semantics, even
    >> SI1
    >>> might get reassigned to SU1."
    >>>
    >>> My comment on V2:
    >>>
    >>> The configuration in #2233 is different from the example in
    >>> specification, but it sounds to me the attempt to instantiate failed
    >>> component should be done as soon as possible.
    >>> The check in V2 patch means the failed component won't be instantiated
    >>> if its SU still has any assignment. It should be true to 2N and N+M, but
    >>> it's not for other SG. (As the example in specification, S2 does not
    >>> have any CSI assigned to failed component C2).
    >> [Praveen]As of now we have documented in the PR doc (conformance table
    >> section 3.11.1.3 Recovery) that if a component faults with comp-failover
    >> recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active
    >> and
    >> N+M models also. This is just to highlight about other red models. But
    >> this documentation is not clear for an unassigned comp.
    >> But from the beginning, comp-failover is working this way only. At-least
    >> from clean up perspective we have fixed the problem of parallelism in
    >> the past in the ticket #474.
    >>
    >> One more thing I have noted, proxy-proxied implementation is based on
    >> B.01.01. As per B.01.01, proxy will register himself and its proxied as
    >> soon as it gets instantiated. In a configuration containing both proxy
    >> and proxied comp, if the proxy does not get any CSI and it faults with
    >> comp-failover recovery then in instantiation phase it may again register
    >> its proxy. I think proxy in other SU should register its proxied. I
    >> guess, from deployment perspective such a configuration in which a user
    >> configures proxy without any CSI may not exists and only possibility is
    >> an application modeling a legacy code in NoRed model. However, in the
    >> later version of spec B.01.02, proxied was supposed to mention the name
    >> of proxy CSI and thus proxy should register only when its get proxy CSI.
    >>
    >> One more point to be noted comp-failover can also be done as a part of
    >> escalation also. If a component is instantiated before the completion of
    >> comp-failover recovery and if faults again then it may escalate to
    >> node-failover before completion of comp-failover recovery.
    >>
    >> Since in spec there is no specific discussion for comp-failover recovery
    >> for an unassigned comp, I will encourage other maintainers also to
    >> provide inputs.
    >>
    >>
    >> Thanks,
    >> Praveen
    >>
    >>
    >>    Moreover, in the clc.cc,
    >>> amfnd does not check any of si_list.n_nodes, this probably is the logic
    >>> that has being done so far.
    >>>
    >>> Thanks,
    >>> Minh
    >>>
    >>> On 17/02/17 23:16, praveen malviya wrote:
    >>>> Hi Minh,
    >>>>
    >>>> I think we should see this problem from fault management perspective
    >>>> also. Here repair of failed component is performed before the
    >>>> completion of recovery.In the problem, component faulted with
    >>>> comp-failover recovery and it was successfully repaired(instantiated)
    >>>> when SU switch-over was still pending.
    >>>>
    >>>> Now the question is: Why it was never observed earlier? The reason is
    >>>> generally all components are assigned at least one CSI. In the present
    >>>> configuration failed component was not assigned any CSI. When this
    >>>> component was cleaned up and marked UNINSTANTIATED, AMFND sent
    >>>> comp-failover recovery request to AMFD. But after sending recovery
    >>>> request, it instantiated failed comp when SU has still assignments to
    >>>> be switch-overed. The code related to this assumes that comp will have
    >>>> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(),
    >>>> TERMINATING to UNINSTANTIATED if block). For normal sequence of
    >>>> comp-failover, su is repaired after removal of assignment in
    >>>> avnd_su_si_oper_done() by calling avnd_err_su_repair().
    >>>>
    >>>> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page
    >>>> 195) about switch-overing all the SIs of failed SU in case of
    >>>> comp-failed recovery and not for other models. In current OpenSAF
    >>>> implementation we are following this for all models.
    >>>>
    >>>> I think as a fix we should stop failed comp to get instantiated before
    >>>> removal of assignments. For this the check in clc.cc can be hardened
    >>>> to consider non-assigned comp failures.
    >>>> Attached is the patch (2233_v2.patch) based on this idea/approach.
    >>>>
    >>>> Thanks,
    >>>> Praveen
    >>>>
    >>>>
    >>>> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
    >>>>> Hi Praveen,
    >>>>>
    >>>>> Yes, you are right, I will update the description.
    >>>>>
    >>>>> Thanks, Minh
    >>>>>
    >>>>> Quoting praveen malviya <[email protected]>:
    >>>>>
    >>>>>> Hi Minh,
    >>>>>>
    >>>>>> One quick question:
    >>>>>> Ticket description says:
    >>>>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1
    >> depends
    >>>>>> safSi=AmfDemoTwon"
    >>>>>> But logs are related to without SIdep. Also in the configuration
    >>>>>> app3_twon3su3si.xml, SI dep classes are commented.
    >>>>>> I think ticket description needs correction as problem is without SI
    >>>>>> dep.
    >>>>>> Please confirm.
    >>>>>>
    >>>>>> Thanks,
    >>>>>> Praveen
    >>>>>>
    >>>>>>
    >>>>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
    >>>>>>> Hi Minh,
    >>>>>>>
    >>>>>>> I have started reviewing this patch.
    >>>>>>>
    >>>>>>> Thanks,
    >>>>>>> Praveen
    >>>>>>>
    >>>>>>> On 15-Feb-17 9:22 AM, minh chau wrote:
    >>>>>>>> Hi all,
    >>>>>>>>
    >>>>>>>> Have you had time to review this patch?
    >>>>>>>> It changes the component failover sequence, so I think we need
    >> more
    >>>>>>>> time
    >>>>>>>> to look at it.
    >>>>>>>>
    >>>>>>>> Thanks,
    >>>>>>>> Minh
    >>>>>>>>
    >>>>>>>> On 23/01/17 12:28, Minh Hon Chau wrote:
    >>>>>>>>>   src/amf/amfnd/avnd_su.h |   1 +
    >>>>>>>>>   src/amf/amfnd/clc.cc    |   3 ---
    >>>>>>>>>   src/amf/amfnd/di.cc     |  12 +++++++++++-
    >>>>>>>>>   src/amf/amfnd/susm.cc   |  32
    >> +++++++++++++++++++++++++++++---
    >>>>>>>>>   4 files changed, 41 insertions(+), 7 deletions(-)
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>> In case component failover, faulty component will be terminated.
    >>>>>>>>> When
    >>>>>>>>> the reinstantiation
    >>>>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd which
    >> is
    >>>>>>>>> running along with
    >>>>>>>>> component failover. In the reported problem, if su_oper_message
    >>>>>>>>> (enabled) comes to amfd
    >>>>>>>>> before the quiesced assignment response (as part of component
    >>>>>>>>> failover
    >>>>>>>>> sequence) comes to
    >>>>>>>>> amfd, then this quiesced assignment response is ignored, thus
    >>>>>>>>> component failover will not
    >>>>>>>>> finish.
    >>>>>>>>>
    >>>>>>>>> The problem is in function susi_success_sg_realign with act=5,
    >>>>>>>>> state=3, amfd always assumes
    >>>>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption is
    >>>>>>>>> true
    >>>>>>>>> in most of the time
    >>>>>>>>> when su_oper_message (enabled) comes a little later than quiesced
    >>>>>>>>> assignment response. In fact
    >>>>>>>>> the su_oper_message (enabled) is not designed as part of
    >> component
    >>>>>>>>> failover sequence, thus it
    >>>>>>>>> can come any time during the failover. If amfd is getting a bit
    >>>>>>>>> busier
    >>>>>>>>> with RTA update then
    >>>>>>>>> the faulty component has enough to reinstiantiate so that amfnd
    >>>>>>>>> sends
    >>>>>>>>> su_oper_message (enabled)
    >>>>>>>>> before quiesced assignment response, the reported problem will be
    >>>>>>>>> seen.
    >>>>>>>>>
    >>>>>>>>> This patch hardens the component failover sequence by ensuring
    >> the
    >>>>>>>>> su_oper_message (enabled) to
    >>>>>>>>> be sent after su completes to remove assignment. This approach
    >> comes
    >>>>>>>>> from the similarity in
    >>>>>>>>> su failover, where the su_oper_message (enabled) is sent in repair
    >>>>>>>>> phase.
    >>>>>>>>>
    >>>>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
    >>>>>>>>> --- a/src/amf/amfnd/avnd_su.h
    >>>>>>>>> +++ b/src/amf/amfnd/avnd_su.h
    >>>>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
    >>>>>>>>>   extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
    >>>>>>>>> avnd_su_si_rec *);
    >>>>>>>>>   extern struct avnd_su_si_rec *avnd_silist_getlast(void);
    >>>>>>>>>   extern bool sufailover_in_progress(const AVND_SU *su);
    >>>>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su);
    >>>>>>>>>   extern bool sufailover_during_nodeswitchover(const AVND_SU
    >> *su);
    >>>>>>>>>   extern bool all_csis_in_removed_state(const AVND_SU *su);
    >>>>>>>>>   extern void su_reset_restart_count_in_comps(const struct
    >>>>>>>>> avnd_cb_tag
    >>>>>>>>> *cb, const AVND_SU *su);
    >>>>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
    >>>>>>>>> --- a/src/amf/amfnd/clc.cc
    >>>>>>>>> +++ b/src/amf/amfnd/clc.cc
    >>>>>>>>> @@ -2381,9 +2381,6 @@ uint32_t
    >> avnd_comp_clc_terming_cleansucc
    >>>>>>>>>               (m_AVND_SU_IS_FAILOVER(su))) {
    >>>>>>>>>           /* yes, request director to orchestrate component
    >>>>>>>>> failover */
    >>>>>>>>>           rc = avnd_di_oper_send(cb, su,
    >> SA_AMF_COMPONENT_FAILOVER);
    >>>>>>>>> -
    >>>>>>>>> -        //Reset component-failover here. SU failover is reset as
    >>>>>>>>> part
    >>>>>>>>> of REPAIRED admin op.
    >>>>>>>>> -        m_AVND_SU_FAILOVER_RESET(su);
    >>>>>>>>>       }
    >>>>>>>>>         /*
    >>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
    >>>>>>>>> --- a/src/amf/amfnd/di.cc
    >>>>>>>>> +++ b/src/amf/amfnd/di.cc
    >>>>>>>>> @@ -894,7 +894,17 @@ uint32_t
    >> avnd_di_susi_resp_send(AVND_CB
    >>>>>>>>>               }
    >>>>>>>>>               m_AVND_SU_ALL_SI_RESET(su);
    >>>>>>>>>           }
    >>>>>>>>> -
    >>>>>>>>> +        if (componentfailover_in_progress(su)) {
    >>>>>>>>> +            if (all_csis_in_removed_state(su) == true) {
    >>>>>>>>> +                bool is_en;
    >>>>>>>>> +                m_AVND_SU_IS_ENABLED(su, is_en);
    >>>>>>>>> +                if (is_en) {
    >>>>>>>>> +                    if (avnd_di_oper_send(cb, su, 0) ==
    >>>>>>>>> NCSCC_RC_SUCCESS) {
    >>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
    >>>>>>>>> +                    }
    >>>>>>>>> +                }
    >>>>>>>>> +            }
    >>>>>>>>> +        }
    >>>>>>>>>       /* free the contents of avnd message */
    >>>>>>>>>       avnd_msg_content_free(cb, &msg);
    >>>>>>>>>   diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
    >>>>>>>>> --- a/src/amf/amfnd/susm.cc
    >>>>>>>>> +++ b/src/amf/amfnd/susm.cc
    >>>>>>>>> @@ -1633,10 +1633,22 @@ uint32_t
    >> avnd_su_pres_st_chng_prc(AVND_C
    >>>>>>>>>               m_AVND_SU_IS_ENABLED(su, is_en);
    >>>>>>>>>               if (true == is_en) {
    >>>>>>>>>                   TRACE("SU oper state is enabled");
    >>>>>>>>> +                // do not send su_oper state if component
    >>>>>>>>> failover is
    >>>>>>>>> in progress
    >>>>>>>>>                   m_AVND_SU_OPER_STATE_SET(su,
    >>>>>>>>> SA_AMF_OPERATIONAL_ENABLED);
    >>>>>>>>> -                rc = avnd_di_oper_send(cb, su, 0);
    >>>>>>>>> -                if (NCSCC_RC_SUCCESS != rc)
    >>>>>>>>> -                    goto done;
    >>>>>>>>> +                if (componentfailover_in_progress(su) == true) {
    >>>>>>>>> +                    si = reinterpret_cast<AVND_SU_SI_REC*>
    >>>>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
    >>>>>>>>> +                    if (si == nullptr ||
    >>>>>>>>> all_csis_in_removed_state(su)) {
    >>>>>>>>> +                        rc = avnd_di_oper_send(cb, su, 0);
    >>>>>>>>> +                        if (rc != NCSCC_RC_SUCCESS)
    >>>>>>>>> +                            goto done;
    >>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
    >>>>>>>>> +                    }
    >>>>>>>>> +                } else {
    >>>>>>>>> +                    rc = avnd_di_oper_send(cb, su, 0);
    >>>>>>>>> +                    if (NCSCC_RC_SUCCESS != rc)
    >>>>>>>>> +                        goto done;
    >>>>>>>>> +                }
    >>>>>>>>>               }
    >>>>>>>>>               else
    >>>>>>>>>                   TRACE("SU oper state is disabled");
    >>>>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const
    >> AVND_S
    >>>>>>>>>   }
    >>>>>>>>>     /**
    >>>>>>>>> + * This function checks if the componentfailover is going on.
    >>>>>>>>> + * @param su: ptr to the SU .
    >>>>>>>>> + *
    >>>>>>>>> + * @return true/false.
    >>>>>>>>> + */
    >>>>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) {
    >>>>>>>>> +    if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su))
    >> &&
    >>>>>>>>> +            (avnd_cb->oper_state !=
    >> SA_AMF_OPERATIONAL_DISABLED) &&
    >>>>>>>>> (!su->is_ncs) &&
    >>>>>>>>> +            m_AVND_SU_IS_FAILOVER(su))
    >>>>>>>>> +        return true;
    >>>>>>>>> +    return false;
    >>>>>>>>> +}
    >>>>>>>>> +
    >>>>>>>>> +/**
    >>>>>>>>>    * This function checks if the sufailover and node switchover 
are
    >>>>>>>>> going on.
    >>>>>>>>>    * @param su: ptr to the SU .
    >>>>>>>>>    *
    >>>>>>>>>
    >>>>>>> 
------------------------------------------------------------------------------
    >>>>>>>
    >>>>>>>
    >>>>>>> Check out the vibrant tech community on one of the world's most
    >>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    >>>>>>> _______________________________________________
    >>>>>>> Opensaf-devel mailing list
    >>>>>>> [email protected]
    >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
    >>>>>>>
    >>>>>
    
    



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

Reply via email to