Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

praveen malviya Mon, 20 Feb 2017 21:47:38 -0800

Hi Minh,

Please find my response inline with [Praveen].


Thanks,
Praveen

On 20-Feb-17 6:58 AM, minh chau wrote:
> Hi Praveen,
>
> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
> and #1902, it also can fix the problem.
> Here we have 2 solutions:
> - The one I sent for review is letting the failed component to be
> instantiated, I think it is current behavior. But one change is that
> amfnd will not report su operational message to amfd until amfnd
> finishes removing the assignment of (faulty) su which contains the
> failed component
> - The V2 patch postpones the instantiation of failed component. amfnd
> will instantiate the failed component (via avnd_err_su_repair) after
> amfnd finishes removing the assignment of faulty su.
>
> So basically the difference is the time that the failed component should
> be instantiated.
>
> Still in item 3.11.1.3.2:
> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
> other-wise, the number of active service units would be higher than what
> is allowed by the redundancy model. However, in an Nway redundancy
> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
> repair of C2 should be attempted by reinstantiating it. If the attempt
> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
> be switched-over; however, if the attempt to instantiate C2 is
> successful, SI2 shall remain assigned to SU1, and based on other
> configuration parameters and N-way redundancy model semantics, even SI1
> might get reassigned to SU1."
>
> My comment on V2:
>
> The configuration in #2233 is different from the example in
> specification, but it sounds to me the attempt to instantiate failed
> component should be done as soon as possible.
> The check in V2 patch means the failed component won't be instantiated
> if its SU still has any assignment. It should be true to 2N and N+M, but
> it's not for other SG. (As the example in specification, S2 does not
> have any CSI assigned to failed component C2).
[Praveen]As of now we have documented in the PR doc (conformance table 
section 3.11.1.3 Recovery) that if a component faults with comp-failover 
recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active and 
N+M models also. This is just to highlight about other red models. But 
this documentation is not clear for an unassigned comp.
But from the beginning, comp-failover is working this way only. At-least 
from clean up perspective we have fixed the problem of parallelism in 
the past in the ticket #474.

One more thing I have noted, proxy-proxied implementation is based on 
B.01.01. As per B.01.01, proxy will register himself and its proxied as 
soon as it gets instantiated. In a configuration containing both proxy 
and proxied comp, if the proxy does not get any CSI and it faults with 
comp-failover recovery then in instantiation phase it may again register 
its proxy. I think proxy in other SU should register its proxied. I 
guess, from deployment perspective such a configuration in which a user 
configures proxy without any CSI may not exists and only possibility is 
an application modeling a legacy code in NoRed model. However, in the 
later version of spec B.01.02, proxied was supposed to mention the name 
of proxy CSI and thus proxy should register only when its get proxy CSI.

One more point to be noted comp-failover can also be done as a part of 
escalation also. If a component is instantiated before the completion of 
comp-failover recovery and if faults again then it may escalate to 
node-failover before completion of comp-failover recovery.

Since in spec there is no specific discussion for comp-failover recovery 
for an unassigned comp, I will encourage other maintainers also to 
provide inputs.


Thanks,
Praveen


  Moreover, in the clc.cc,
> amfnd does not check any of si_list.n_nodes, this probably is the logic
> that has being done so far.
>
> Thanks,
> Minh
>
> On 17/02/17 23:16, praveen malviya wrote:
>> Hi Minh,
>>
>> I think we should see this problem from fault management perspective
>> also. Here repair of failed component is performed before the
>> completion of recovery.In the problem, component faulted with
>> comp-failover recovery and it was successfully repaired(instantiated)
>> when SU switch-over was still pending.
>>
>> Now the question is: Why it was never observed earlier? The reason is
>> generally all components are assigned at least one CSI. In the present
>> configuration failed component was not assigned any CSI. When this
>> component was cleaned up and marked UNINSTANTIATED, AMFND sent
>> comp-failover recovery request to AMFD. But after sending recovery
>> request, it instantiated failed comp when SU has still assignments to
>> be switch-overed. The code related to this assumes that comp will have
>> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(),
>> TERMINATING to UNINSTANTIATED if block). For normal sequence of
>> comp-failover, su is repaired after removal of assignment in
>> avnd_su_si_oper_done() by calling avnd_err_su_repair().
>>
>> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page
>> 195) about switch-overing all the SIs of failed SU in case of
>> comp-failed recovery and not for other models. In current OpenSAF
>> implementation we are following this for all models.
>>
>> I think as a fix we should stop failed comp to get instantiated before
>> removal of assignments. For this the check in clc.cc can be hardened
>> to consider non-assigned comp failures.
>> Attached is the patch (2233_v2.patch) based on this idea/approach.
>>
>> Thanks,
>> Praveen
>>
>>
>> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
>>> Hi Praveen,
>>>
>>> Yes, you are right, I will update the description.
>>>
>>> Thanks, Minh
>>>
>>> Quoting praveen malviya <[email protected]>:
>>>
>>>> Hi Minh,
>>>>
>>>> One quick question:
>>>> Ticket description says:
>>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends
>>>> safSi=AmfDemoTwon"
>>>> But logs are related to without SIdep. Also in the configuration
>>>> app3_twon3su3si.xml, SI dep classes are commented.
>>>> I think ticket description needs correction as problem is without SI
>>>> dep.
>>>> Please confirm.
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> I have started reviewing this patch.
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>> On 15-Feb-17 9:22 AM, minh chau wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Have you had time to review this patch?
>>>>>> It changes the component failover sequence, so I think we need more
>>>>>> time
>>>>>> to look at it.
>>>>>>
>>>>>> Thanks,
>>>>>> Minh
>>>>>>
>>>>>> On 23/01/17 12:28, Minh Hon Chau wrote:
>>>>>>>  src/amf/amfnd/avnd_su.h |   1 +
>>>>>>>  src/amf/amfnd/clc.cc    |   3 ---
>>>>>>>  src/amf/amfnd/di.cc     |  12 +++++++++++-
>>>>>>>  src/amf/amfnd/susm.cc   |  32 +++++++++++++++++++++++++++++---
>>>>>>>  4 files changed, 41 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>>
>>>>>>> In case component failover, faulty component will be terminated.
>>>>>>> When
>>>>>>> the reinstantiation
>>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd which is
>>>>>>> running along with
>>>>>>> component failover. In the reported problem, if su_oper_message
>>>>>>> (enabled) comes to amfd
>>>>>>> before the quiesced assignment response (as part of component
>>>>>>> failover
>>>>>>> sequence) comes to
>>>>>>> amfd, then this quiesced assignment response is ignored, thus
>>>>>>> component failover will not
>>>>>>> finish.
>>>>>>>
>>>>>>> The problem is in function susi_success_sg_realign with act=5,
>>>>>>> state=3, amfd always assumes
>>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption is
>>>>>>> true
>>>>>>> in most of the time
>>>>>>> when su_oper_message (enabled) comes a little later than quiesced
>>>>>>> assignment response. In fact
>>>>>>> the su_oper_message (enabled) is not designed as part of component
>>>>>>> failover sequence, thus it
>>>>>>> can come any time during the failover. If amfd is getting a bit
>>>>>>> busier
>>>>>>> with RTA update then
>>>>>>> the faulty component has enough to reinstiantiate so that amfnd
>>>>>>> sends
>>>>>>> su_oper_message (enabled)
>>>>>>> before quiesced assignment response, the reported problem will be
>>>>>>> seen.
>>>>>>>
>>>>>>> This patch hardens the component failover sequence by ensuring the
>>>>>>> su_oper_message (enabled) to
>>>>>>> be sent after su completes to remove assignment. This approach comes
>>>>>>> from the similarity in
>>>>>>> su failover, where the su_oper_message (enabled) is sent in repair
>>>>>>> phase.
>>>>>>>
>>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
>>>>>>> --- a/src/amf/amfnd/avnd_su.h
>>>>>>> +++ b/src/amf/amfnd/avnd_su.h
>>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
>>>>>>>  extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
>>>>>>> avnd_su_si_rec *);
>>>>>>>  extern struct avnd_su_si_rec *avnd_silist_getlast(void);
>>>>>>>  extern bool sufailover_in_progress(const AVND_SU *su);
>>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su);
>>>>>>>  extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
>>>>>>>  extern bool all_csis_in_removed_state(const AVND_SU *su);
>>>>>>>  extern void su_reset_restart_count_in_comps(const struct
>>>>>>> avnd_cb_tag
>>>>>>> *cb, const AVND_SU *su);
>>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
>>>>>>> --- a/src/amf/amfnd/clc.cc
>>>>>>> +++ b/src/amf/amfnd/clc.cc
>>>>>>> @@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
>>>>>>>              (m_AVND_SU_IS_FAILOVER(su))) {
>>>>>>>          /* yes, request director to orchestrate component
>>>>>>> failover */
>>>>>>>          rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
>>>>>>> -
>>>>>>> -        //Reset component-failover here. SU failover is reset as
>>>>>>> part
>>>>>>> of REPAIRED admin op.
>>>>>>> -        m_AVND_SU_FAILOVER_RESET(su);
>>>>>>>      }
>>>>>>>        /*
>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>>>>> --- a/src/amf/amfnd/di.cc
>>>>>>> +++ b/src/amf/amfnd/di.cc
>>>>>>> @@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>>>>>>              }
>>>>>>>              m_AVND_SU_ALL_SI_RESET(su);
>>>>>>>          }
>>>>>>> -
>>>>>>> +        if (componentfailover_in_progress(su)) {
>>>>>>> +            if (all_csis_in_removed_state(su) == true) {
>>>>>>> +                bool is_en;
>>>>>>> +                m_AVND_SU_IS_ENABLED(su, is_en);
>>>>>>> +                if (is_en) {
>>>>>>> +                    if (avnd_di_oper_send(cb, su, 0) ==
>>>>>>> NCSCC_RC_SUCCESS) {
>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>>>> +                    }
>>>>>>> +                }
>>>>>>> +            }
>>>>>>> +        }
>>>>>>>      /* free the contents of avnd message */
>>>>>>>      avnd_msg_content_free(cb, &msg);
>>>>>>>  diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
>>>>>>> --- a/src/amf/amfnd/susm.cc
>>>>>>> +++ b/src/amf/amfnd/susm.cc
>>>>>>> @@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
>>>>>>>              m_AVND_SU_IS_ENABLED(su, is_en);
>>>>>>>              if (true == is_en) {
>>>>>>>                  TRACE("SU oper state is enabled");
>>>>>>> +                // do not send su_oper state if component
>>>>>>> failover is
>>>>>>> in progress
>>>>>>>                  m_AVND_SU_OPER_STATE_SET(su,
>>>>>>> SA_AMF_OPERATIONAL_ENABLED);
>>>>>>> -                rc = avnd_di_oper_send(cb, su, 0);
>>>>>>> -                if (NCSCC_RC_SUCCESS != rc)
>>>>>>> -                    goto done;
>>>>>>> +                if (componentfailover_in_progress(su) == true) {
>>>>>>> +                    si = reinterpret_cast<AVND_SU_SI_REC*>
>>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
>>>>>>> +                    if (si == nullptr ||
>>>>>>> all_csis_in_removed_state(su)) {
>>>>>>> +                        rc = avnd_di_oper_send(cb, su, 0);
>>>>>>> +                        if (rc != NCSCC_RC_SUCCESS)
>>>>>>> +                            goto done;
>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>>>> +                    }
>>>>>>> +                } else {
>>>>>>> +                    rc = avnd_di_oper_send(cb, su, 0);
>>>>>>> +                    if (NCSCC_RC_SUCCESS != rc)
>>>>>>> +                        goto done;
>>>>>>> +                }
>>>>>>>              }
>>>>>>>              else
>>>>>>>                  TRACE("SU oper state is disabled");
>>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const AVND_S
>>>>>>>  }
>>>>>>>    /**
>>>>>>> + * This function checks if the componentfailover is going on.
>>>>>>> + * @param su: ptr to the SU .
>>>>>>> + *
>>>>>>> + * @return true/false.
>>>>>>> + */
>>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) {
>>>>>>> +    if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) &&
>>>>>>> +            (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) &&
>>>>>>> (!su->is_ncs) &&
>>>>>>> +            m_AVND_SU_IS_FAILOVER(su))
>>>>>>> +        return true;
>>>>>>> +    return false;
>>>>>>> +}
>>>>>>> +
>>>>>>> +/**
>>>>>>>   * This function checks if the sufailover and node switchover are
>>>>>>> going on.
>>>>>>>   * @param su: ptr to the SU .
>>>>>>>   *
>>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Opensaf-devel mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>>
>>>
>>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

Reply via email to