Hi Nagu, Praveen,

Please find my comment in [Minh3]

Thanks,
Minh

On 22/02/17 19:34, Nagendra Kumar wrote:
>>> Since in spec there is no specific discussion for comp-failover recovery 
>>> for an unassigned comp, I will encourage other maintainers also to provide 
>>> inputs.
> I do agree for not instantiating failed component before recovery, this keeps 
> the approach similar to SU failover also.
[Minh3]: There's one example of component failover that I would like us 
to have a look
- 2N application, SU4/SU5 has active/standby assignment respectively, 
each SU has 3 components
- Add a sleep of 10 seconds in clc script start command of first 
component C41 of SU4
Steps:
1- Kill C41 to trigger component failover
2- SU4 goes for quiesced assignment
3- SU5 goes for active assignment
4- SU4 is removed its assignment
5- Now there's a pause of 10 seconds due to clc script start, to ensure 
that C41 is healthy
6- Next SU4 has standby assignment.

 From the above example, I think we can see some problems if the 
re-instantiation of C41 is delayed:
- Because C41 is faulty, it needs to be restarted ok because its SU has 
assignment
- Moving re-instantiation of C41 is further down that means the recovery 
will take longer
- What if re-instantiation of C41 leads to instantation-failed

Whether or not the C41 has assignment or is unassigned, the 
OperState/PresenceState result from re-instantiation of faulty C41 
affects to SU4's eligibility for assignment.
There's a parallelism between [restart of faulty component C41] and 
[movement from Active->Quiesced->Removed assignment of SU4], it's good 
to have and it's current behavior of amfnd.
It's my understanding so far but I could be wrong. Let's check with Hans 
and Gary.

>
> @Minh: If you don't mind, we can take su oper state changes in an 
> enhancement. What do you say ?
[Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 release.
>
> Thanks
> -Nagu
>
>> -----Original Message-----
>> From: praveen malviya
>> Sent: 21 February 2017 11:15
>> To: minh chau
>> Cc: [email protected]; Nagendra Kumar;
>> [email protected]; [email protected]; opensaf-
>> [email protected]
>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message
>> synchronizes with component failover sequence [#2233]
>>
>> Hi Minh,
>>
>> Please find my response inline with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>> On 20-Feb-17 6:58 AM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
>>> and #1902, it also can fix the problem.
>>> Here we have 2 solutions:
>>> - The one I sent for review is letting the failed component to be
>>> instantiated, I think it is current behavior. But one change is that
>>> amfnd will not report su operational message to amfd until amfnd
>>> finishes removing the assignment of (faulty) su which contains the
>>> failed component
>>> - The V2 patch postpones the instantiation of failed component. amfnd
>>> will instantiate the failed component (via avnd_err_su_repair) after
>>> amfnd finishes removing the assignment of faulty su.
>>>
>>> So basically the difference is the time that the failed component should
>>> be instantiated.
>>>
>>> Still in item 3.11.1.3.2:
>>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
>>> other-wise, the number of active service units would be higher than what
>>> is allowed by the redundancy model. However, in an Nway redundancy
>>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
>>> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
>>> repair of C2 should be attempted by reinstantiating it. If the attempt
>>> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
>>> be switched-over; however, if the attempt to instantiate C2 is
>>> successful, SI2 shall remain assigned to SU1, and based on other
>>> configuration parameters and N-way redundancy model semantics, even
>> SI1
>>> might get reassigned to SU1."
>>>
>>> My comment on V2:
>>>
>>> The configuration in #2233 is different from the example in
>>> specification, but it sounds to me the attempt to instantiate failed
>>> component should be done as soon as possible.
>>> The check in V2 patch means the failed component won't be instantiated
>>> if its SU still has any assignment. It should be true to 2N and N+M, but
>>> it's not for other SG. (As the example in specification, S2 does not
>>> have any CSI assigned to failed component C2).
>> [Praveen]As of now we have documented in the PR doc (conformance table
>> section 3.11.1.3 Recovery) that if a component faults with comp-failover
>> recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active
>> and
>> N+M models also. This is just to highlight about other red models. But
>> this documentation is not clear for an unassigned comp.
>> But from the beginning, comp-failover is working this way only. At-least
>> from clean up perspective we have fixed the problem of parallelism in
>> the past in the ticket #474.
>>
>> One more thing I have noted, proxy-proxied implementation is based on
>> B.01.01. As per B.01.01, proxy will register himself and its proxied as
>> soon as it gets instantiated. In a configuration containing both proxy
>> and proxied comp, if the proxy does not get any CSI and it faults with
>> comp-failover recovery then in instantiation phase it may again register
>> its proxy. I think proxy in other SU should register its proxied. I
>> guess, from deployment perspective such a configuration in which a user
>> configures proxy without any CSI may not exists and only possibility is
>> an application modeling a legacy code in NoRed model. However, in the
>> later version of spec B.01.02, proxied was supposed to mention the name
>> of proxy CSI and thus proxy should register only when its get proxy CSI.
>>
>> One more point to be noted comp-failover can also be done as a part of
>> escalation also. If a component is instantiated before the completion of
>> comp-failover recovery and if faults again then it may escalate to
>> node-failover before completion of comp-failover recovery.
>>
>> Since in spec there is no specific discussion for comp-failover recovery
>> for an unassigned comp, I will encourage other maintainers also to
>> provide inputs.
>>
>>
>> Thanks,
>> Praveen
>>
>>
>>    Moreover, in the clc.cc,
>>> amfnd does not check any of si_list.n_nodes, this probably is the logic
>>> that has being done so far.
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 17/02/17 23:16, praveen malviya wrote:
>>>> Hi Minh,
>>>>
>>>> I think we should see this problem from fault management perspective
>>>> also. Here repair of failed component is performed before the
>>>> completion of recovery.In the problem, component faulted with
>>>> comp-failover recovery and it was successfully repaired(instantiated)
>>>> when SU switch-over was still pending.
>>>>
>>>> Now the question is: Why it was never observed earlier? The reason is
>>>> generally all components are assigned at least one CSI. In the present
>>>> configuration failed component was not assigned any CSI. When this
>>>> component was cleaned up and marked UNINSTANTIATED, AMFND sent
>>>> comp-failover recovery request to AMFD. But after sending recovery
>>>> request, it instantiated failed comp when SU has still assignments to
>>>> be switch-overed. The code related to this assumes that comp will have
>>>> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(),
>>>> TERMINATING to UNINSTANTIATED if block). For normal sequence of
>>>> comp-failover, su is repaired after removal of assignment in
>>>> avnd_su_si_oper_done() by calling avnd_err_su_repair().
>>>>
>>>> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page
>>>> 195) about switch-overing all the SIs of failed SU in case of
>>>> comp-failed recovery and not for other models. In current OpenSAF
>>>> implementation we are following this for all models.
>>>>
>>>> I think as a fix we should stop failed comp to get instantiated before
>>>> removal of assignments. For this the check in clc.cc can be hardened
>>>> to consider non-assigned comp failures.
>>>> Attached is the patch (2233_v2.patch) based on this idea/approach.
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
>>>>> Hi Praveen,
>>>>>
>>>>> Yes, you are right, I will update the description.
>>>>>
>>>>> Thanks, Minh
>>>>>
>>>>> Quoting praveen malviya <[email protected]>:
>>>>>
>>>>>> Hi Minh,
>>>>>>
>>>>>> One quick question:
>>>>>> Ticket description says:
>>>>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1
>> depends
>>>>>> safSi=AmfDemoTwon"
>>>>>> But logs are related to without SIdep. Also in the configuration
>>>>>> app3_twon3su3si.xml, SI dep classes are commented.
>>>>>> I think ticket description needs correction as problem is without SI
>>>>>> dep.
>>>>>> Please confirm.
>>>>>>
>>>>>> Thanks,
>>>>>> Praveen
>>>>>>
>>>>>>
>>>>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
>>>>>>> Hi Minh,
>>>>>>>
>>>>>>> I have started reviewing this patch.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Praveen
>>>>>>>
>>>>>>> On 15-Feb-17 9:22 AM, minh chau wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Have you had time to review this patch?
>>>>>>>> It changes the component failover sequence, so I think we need
>> more
>>>>>>>> time
>>>>>>>> to look at it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Minh
>>>>>>>>
>>>>>>>> On 23/01/17 12:28, Minh Hon Chau wrote:
>>>>>>>>>   src/amf/amfnd/avnd_su.h |   1 +
>>>>>>>>>   src/amf/amfnd/clc.cc    |   3 ---
>>>>>>>>>   src/amf/amfnd/di.cc     |  12 +++++++++++-
>>>>>>>>>   src/amf/amfnd/susm.cc   |  32
>> +++++++++++++++++++++++++++++---
>>>>>>>>>   4 files changed, 41 insertions(+), 7 deletions(-)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In case component failover, faulty component will be terminated.
>>>>>>>>> When
>>>>>>>>> the reinstantiation
>>>>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd which
>> is
>>>>>>>>> running along with
>>>>>>>>> component failover. In the reported problem, if su_oper_message
>>>>>>>>> (enabled) comes to amfd
>>>>>>>>> before the quiesced assignment response (as part of component
>>>>>>>>> failover
>>>>>>>>> sequence) comes to
>>>>>>>>> amfd, then this quiesced assignment response is ignored, thus
>>>>>>>>> component failover will not
>>>>>>>>> finish.
>>>>>>>>>
>>>>>>>>> The problem is in function susi_success_sg_realign with act=5,
>>>>>>>>> state=3, amfd always assumes
>>>>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption is
>>>>>>>>> true
>>>>>>>>> in most of the time
>>>>>>>>> when su_oper_message (enabled) comes a little later than quiesced
>>>>>>>>> assignment response. In fact
>>>>>>>>> the su_oper_message (enabled) is not designed as part of
>> component
>>>>>>>>> failover sequence, thus it
>>>>>>>>> can come any time during the failover. If amfd is getting a bit
>>>>>>>>> busier
>>>>>>>>> with RTA update then
>>>>>>>>> the faulty component has enough to reinstiantiate so that amfnd
>>>>>>>>> sends
>>>>>>>>> su_oper_message (enabled)
>>>>>>>>> before quiesced assignment response, the reported problem will be
>>>>>>>>> seen.
>>>>>>>>>
>>>>>>>>> This patch hardens the component failover sequence by ensuring
>> the
>>>>>>>>> su_oper_message (enabled) to
>>>>>>>>> be sent after su completes to remove assignment. This approach
>> comes
>>>>>>>>> from the similarity in
>>>>>>>>> su failover, where the su_oper_message (enabled) is sent in repair
>>>>>>>>> phase.
>>>>>>>>>
>>>>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
>>>>>>>>> --- a/src/amf/amfnd/avnd_su.h
>>>>>>>>> +++ b/src/amf/amfnd/avnd_su.h
>>>>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
>>>>>>>>>   extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
>>>>>>>>> avnd_su_si_rec *);
>>>>>>>>>   extern struct avnd_su_si_rec *avnd_silist_getlast(void);
>>>>>>>>>   extern bool sufailover_in_progress(const AVND_SU *su);
>>>>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su);
>>>>>>>>>   extern bool sufailover_during_nodeswitchover(const AVND_SU
>> *su);
>>>>>>>>>   extern bool all_csis_in_removed_state(const AVND_SU *su);
>>>>>>>>>   extern void su_reset_restart_count_in_comps(const struct
>>>>>>>>> avnd_cb_tag
>>>>>>>>> *cb, const AVND_SU *su);
>>>>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
>>>>>>>>> --- a/src/amf/amfnd/clc.cc
>>>>>>>>> +++ b/src/amf/amfnd/clc.cc
>>>>>>>>> @@ -2381,9 +2381,6 @@ uint32_t
>> avnd_comp_clc_terming_cleansucc
>>>>>>>>>               (m_AVND_SU_IS_FAILOVER(su))) {
>>>>>>>>>           /* yes, request director to orchestrate component
>>>>>>>>> failover */
>>>>>>>>>           rc = avnd_di_oper_send(cb, su,
>> SA_AMF_COMPONENT_FAILOVER);
>>>>>>>>> -
>>>>>>>>> -        //Reset component-failover here. SU failover is reset as
>>>>>>>>> part
>>>>>>>>> of REPAIRED admin op.
>>>>>>>>> -        m_AVND_SU_FAILOVER_RESET(su);
>>>>>>>>>       }
>>>>>>>>>         /*
>>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>>>>>>> --- a/src/amf/amfnd/di.cc
>>>>>>>>> +++ b/src/amf/amfnd/di.cc
>>>>>>>>> @@ -894,7 +894,17 @@ uint32_t
>> avnd_di_susi_resp_send(AVND_CB
>>>>>>>>>               }
>>>>>>>>>               m_AVND_SU_ALL_SI_RESET(su);
>>>>>>>>>           }
>>>>>>>>> -
>>>>>>>>> +        if (componentfailover_in_progress(su)) {
>>>>>>>>> +            if (all_csis_in_removed_state(su) == true) {
>>>>>>>>> +                bool is_en;
>>>>>>>>> +                m_AVND_SU_IS_ENABLED(su, is_en);
>>>>>>>>> +                if (is_en) {
>>>>>>>>> +                    if (avnd_di_oper_send(cb, su, 0) ==
>>>>>>>>> NCSCC_RC_SUCCESS) {
>>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>>>>>> +                    }
>>>>>>>>> +                }
>>>>>>>>> +            }
>>>>>>>>> +        }
>>>>>>>>>       /* free the contents of avnd message */
>>>>>>>>>       avnd_msg_content_free(cb, &msg);
>>>>>>>>>   diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
>>>>>>>>> --- a/src/amf/amfnd/susm.cc
>>>>>>>>> +++ b/src/amf/amfnd/susm.cc
>>>>>>>>> @@ -1633,10 +1633,22 @@ uint32_t
>> avnd_su_pres_st_chng_prc(AVND_C
>>>>>>>>>               m_AVND_SU_IS_ENABLED(su, is_en);
>>>>>>>>>               if (true == is_en) {
>>>>>>>>>                   TRACE("SU oper state is enabled");
>>>>>>>>> +                // do not send su_oper state if component
>>>>>>>>> failover is
>>>>>>>>> in progress
>>>>>>>>>                   m_AVND_SU_OPER_STATE_SET(su,
>>>>>>>>> SA_AMF_OPERATIONAL_ENABLED);
>>>>>>>>> -                rc = avnd_di_oper_send(cb, su, 0);
>>>>>>>>> -                if (NCSCC_RC_SUCCESS != rc)
>>>>>>>>> -                    goto done;
>>>>>>>>> +                if (componentfailover_in_progress(su) == true) {
>>>>>>>>> +                    si = reinterpret_cast<AVND_SU_SI_REC*>
>>>>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
>>>>>>>>> +                    if (si == nullptr ||
>>>>>>>>> all_csis_in_removed_state(su)) {
>>>>>>>>> +                        rc = avnd_di_oper_send(cb, su, 0);
>>>>>>>>> +                        if (rc != NCSCC_RC_SUCCESS)
>>>>>>>>> +                            goto done;
>>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>>>>>> +                    }
>>>>>>>>> +                } else {
>>>>>>>>> +                    rc = avnd_di_oper_send(cb, su, 0);
>>>>>>>>> +                    if (NCSCC_RC_SUCCESS != rc)
>>>>>>>>> +                        goto done;
>>>>>>>>> +                }
>>>>>>>>>               }
>>>>>>>>>               else
>>>>>>>>>                   TRACE("SU oper state is disabled");
>>>>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const
>> AVND_S
>>>>>>>>>   }
>>>>>>>>>     /**
>>>>>>>>> + * This function checks if the componentfailover is going on.
>>>>>>>>> + * @param su: ptr to the SU .
>>>>>>>>> + *
>>>>>>>>> + * @return true/false.
>>>>>>>>> + */
>>>>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) {
>>>>>>>>> +    if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su))
>> &&
>>>>>>>>> +            (avnd_cb->oper_state !=
>> SA_AMF_OPERATIONAL_DISABLED) &&
>>>>>>>>> (!su->is_ncs) &&
>>>>>>>>> +            m_AVND_SU_IS_FAILOVER(su))
>>>>>>>>> +        return true;
>>>>>>>>> +    return false;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>>    * This function checks if the sufailover and node switchover are
>>>>>>>>> going on.
>>>>>>>>>    * @param su: ptr to the SU .
>>>>>>>>>    *
>>>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>>> _______________________________________________
>>>>>>> Opensaf-devel mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>>>>
>>>>>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to