Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

praveen malviya Sun, 05 Mar 2017 22:52:51 -0800

Hi Minh,

Please see inline with [Praveen].


Thanks,
Praveen

On 03-Mar-17 5:39 PM, minh chau wrote:
> Hi Praveen,
>
> I have two comments with [Minh4].
>
> Thanks
> Minh
>
> On 02/03/17 20:49, praveen malviya wrote:
>> Hi Minh,
>> Please see response with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>>
>>
>> On 02-Mar-17 1:43 PM, minh chau wrote:
>>> Hi,
>>>
>>> Thanks Gary.
>>> @Nagu, Praveen: Have you had time to check the example in my previous
>>> email?
>>> The ticket #2179 is about to document that full escalation is supported
>>> for SC absence feature, it is waiting for fix of #2233.
>>> I think there's not big change in code for #2233, it's a matter of
>>> decision to make for re-instantiation of failed component.
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 01/03/17 15:42, Gary Lee wrote:
>>>> Hi
>>>>
>>>> It seems the component should be re-instantiated if it has no CSI.
>>>> Whether or not there is an SI assigned should be irrelevant?
>>>>
>>>> Thanks
>>>> Gary
>>>>
>>>> -----Original Message-----
>>>> From: minh chau <minh.c...@dektech.com.au>
>>>> Date: Thursday, 23 February 2017 at 3:16 pm
>>>> To: Nagendra Kumar <nagendr...@oracle.com>, Praveen Malviya
>>>> <praveen.malv...@oracle.com>
>>>> Cc: <hans.nordeb...@ericsson.com>, gary <gary....@dektech.com.au>,
>>>> <long.hb.ngu...@dektech.com.au>, <opensaf-devel@lists.sourceforge.net>
>>>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational
>>>> message synchronizes with component failover sequence [#2233]
>>>>
>>>>      Hi Nagu, Praveen,
>>>>           Please find my comment in [Minh3]
>>>>           Thanks,
>>>>      Minh
>>>>           On 22/02/17 19:34, Nagendra Kumar wrote:
>>>>      >>> Since in spec there is no specific discussion for
>>>> comp-failover recovery for an unassigned comp, I will encourage other
>>>> maintainers also to provide inputs.
>>>>      > I do agree for not instantiating failed component before
>>>> recovery, this keeps the approach similar to SU failover also.
>>>>      [Minh3]: There's one example of component failover that I would
>>>> like us
>>>>      to have a look
>>>>      - 2N application, SU4/SU5 has active/standby assignment
>>>> respectively,
>>>>      each SU has 3 components
>>>>      - Add a sleep of 10 seconds in clc script start command of first
>>>>      component C41 of SU4
>>>>      Steps:
>>>>      1- Kill C41 to trigger component failover
>>>>      2- SU4 goes for quiesced assignment
>>>>      3- SU5 goes for active assignment
>>>>      4- SU4 is removed its assignment
>>>>      5- Now there's a pause of 10 seconds due to clc script start, to
>>>> ensure
>>>>      that C41 is healthy
>>>>      6- Next SU4 has standby assignment.
>>>>            From the above example, I think we can see some problems if
>>>> the
>>>>      re-instantiation of C41 is delayed:
>>>>      - Because C41 is faulty, it needs to be restarted ok because its
>>>> SU has
>>>>      assignment
>>>>      - Moving re-instantiation of C41 is further down that means the
>>>> recovery
>>>>      will take longer
>>>>      - What if re-instantiation of C41 leads to instantation-failed
>> [Praveen] If AMFND re-instantiate C41 after removal of assignment and
>> it moves to instantiation-failed then:
>> -Node will be rebooted if nodefailfastonterminationfaioure=true.
>> -ifnodefailfastonterminationfaioure=false then as per section 4.6 page
>> 212, SU will be marked INST_FAILED and AMF will have to terminate all
>> the components. Termination of other components will be easier if they
>> do not have assignments or pending assignments.
>>
>> If C41 is instantiated before removal of assignments and it moves to
>> INST_FAILED state, then AMFND will be terminating other comps of SU
>> when they are in the middle of quiesced or removal of assignment. So a
>> component will having different orders of quiesced/removal/terminate
>> callbacks in its mailbox. This will make thing complex.
> [Minh4]: I am not sure if I understand the complex thing you mentioned
> as it has been working like this for long time. If we are going to
> change the current behavior to the way that amfnd will instantiate
> failed component after removal assignment, then I think it should be
> addressed in another enhancement ticket. The complex thing in current
> behavior could be improved/removed if we change to another behavior. It
> looks like a big change not just in the code, also backward compatible
> consideration. At this moment, let's fix the message ordering problem of
> existing code/design (you already agreed?). I can create another
> enhancement/discussion ticket for matter of instantiation of failed
> component, from there more evidence of specs will be added, ... What do
> you think?
[Praveen] I have agreed for proper sequence of messages. I am repeating 
it again, my concern was instantiating the component before the removal 
of assignments in the context of comp-failover recovery.

Regarding both the inst-failed cases (nodefailfastoninstfailure enabled 
or disabled) that I have mentioned, this is current implementation. If a 
component enters INST_FAILED state and nodefailfastonInstfaioure is 
false then AMF, currently, terminates all other components of the SU. So 
if we instantiate the component before removal of assignment then 
termination of other healthy components and quiesced/removal of 
assignment goes in parallel if the comp instantiation fails 
(osafamfnd_v1 traces with your version of patch). But if component is 
instantiated after removal of assignments and it goes to INST-FAILED 
state then there is no queisced or removal sequence (osafamfnd_v2 traces 
with late instantiation of comp).

Also in the current implementation, we are always intantiating any 
assigned failed component with comp-failover recovery after removal of 
assignments.

>>
>>>           Whether or not the C41 has assignment or is unassigned, the
>>>>      OperState/PresenceState result from re-instantiation of faulty C41
>>>>      affects to SU4's eligibility for assignment.
>> [Praveen] Here Su4 will get only fresh assignments after C4 gets
>> enabled. For fresh assignments, AMF can choose any of the spare SUs
>> available and Su4 will be chosen based on ranks.
>>
>> At the same time, AMF spec encourages not to choose faulty SUs soon
>> for assignments. It is highlighted in SG Auto adjust feature context
>> in section 3.6.1.2 Initiation of the Auto-Adjust Procedure for a
>> Service Group:
>> "
>> However, if the completion of a recovery/repair operation
>> has made the service group eligible for auto-adjustment (for example,
>> if a node joins the cluster after the repair), it is not so wise to
>> run the auto-adjust procedure for the service group involving the
>> newly repaired service units immediately. Thus, the service
>> group-level configuration attribute auto-adjust probation period has
>> been introduced (actually, the saAmfSGAutoAdjustProb configuration
>> attribute in the SaAmfSG object class, shown in Section 8.9). When a
>> service unit becomes available for auto-adjustment after a
>> repair/recovery operation, the service unit enters its autoadjust
>> probation period, and it cannot thereby be used for auto-adjustment
>> during this probation period.
>> "
> [Minh4] It seems you are pointing a contradiction from the specs:
> 3.6.1.2, 3.11.1.3.2. I am not sure but it sounds like the
> recovery/repaired in auto-adjustment is for failed su, and the repaired
> su needs a probation period before participating in auto-adjust feature.
> And I don't see a connection to reinstantiation of failed component in
> component failover. Maybe we can elaborate it in another ticket.
[Praveen] I do not see any contradiction here. If auto adjust is 
implemented  and enabled then it will work like this: component failed 
with comp-failover recovery will disable SU. AMF will perform failover 
of component as a part of recovery. After successful recovery AMF will 
perform repair and will instantiate failed comp. This will enable SU and 
now this SU will enter into autoadjust probation period and AMF will be 
running a probation timer for this. Since for this SU auto-adjust 
probation timer is running, it cannot be given any assignments.

>>
>>>>      There's a parallelism between [restart of faulty component C41]
>>>> and
>>>>      [movement from Active->Quiesced->Removed assignment of SU4], it's
>>>> good
>>>>      to have and it's current behavior of amfnd.
>>>>      It's my understanding so far but I could be wrong. Let's check
>>>> with Hans
>>>>      and Gary.
>>>>           >
>>>>      > @Minh: If you don't mind, we can take su oper state changes in
>>>> an enhancement. What do you say ?
>>>>      [Minh3]: Enhancement is ok, but I hope we can have this fix in
>>>> 5.2 release.
>>>>      >
>>>>      > Thanks
>>>>      > -Nagu
>>>>      >
>>>>      >> -----Original Message-----
>>>>      >> From: praveen malviya
>>>>      >> Sent: 21 February 2017 11:15
>>>>      >> To: minh chau
>>>>      >> Cc: hans.nordeb...@ericsson.com; Nagendra Kumar;
>>>>      >> gary....@dektech.com.au; long.hb.ngu...@dektech.com.au;
>>>> opensaf-
>>>>      >> de...@lists.sourceforge.net
>>>>      >> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su
>>>> operational message
>>>>      >> synchronizes with component failover sequence [#2233]
>>>>      >>
>>>>      >> Hi Minh,
>>>>      >>
>>>>      >> Please find my response inline with [Praveen].
>>>>      >>
>>>>      >> Thanks,
>>>>      >> Praveen
>>>>      >>
>>>>      >> On 20-Feb-17 6:58 AM, minh chau wrote:
>>>>      >>> Hi Praveen,
>>>>      >>>
>>>>      >>> Thanks for your V2 patch, I have tested V2 in scenario of
>>>> ticket #2233
>>>>      >>> and #1902, it also can fix the problem.
>>>>      >>> Here we have 2 solutions:
>>>>      >>> - The one I sent for review is letting the failed component
>>>> to be
>>>>      >>> instantiated, I think it is current behavior. But one change
>>>> is that
>>>>      >>> amfnd will not report su operational message to amfd until
>>>> amfnd
>>>>      >>> finishes removing the assignment of (faulty) su which
>>>> contains the
>>>>      >>> failed component
>>>>      >>> - The V2 patch postpones the instantiation of failed
>>>> component. amfnd
>>>>      >>> will instantiate the failed component (via
>>>> avnd_err_su_repair) after
>>>>      >>> amfnd finishes removing the assignment of faulty su.
>>>>      >>>
>>>>      >>> So basically the difference is the time that the failed
>>>> component should
>>>>      >>> be instantiated.
>>>>      >>>
>>>>      >>> Still in item 3.11.1.3.2:
>>>>      >>> "In a 2N or N+M redundancy model, SI2 also needs to be
>>>> switched over;
>>>>      >>> other-wise, the number of active service units would be
>>>> higher than what
>>>>      >>> is allowed by the redundancy model. However, in an Nway
>>>> redundancy
>>>>      >>> model, SI2 could be left assigned to SU1 (if the
>>>> saAmfSUFailover
>>>>      >>> configuration attribute of the ser-vice unit is set to
>>>> SA_FALSE), and a
>>>>      >>> repair of C2 should be attempted by reinstantiating it. If
>>>> the attempt
>>>>      >>> to instantiate C2 fails, the service unit becomes disabled,
>>>> and SI2 must
>>>>      >>> be switched-over; however, if the attempt to instantiate C2 is
>>>>      >>> successful, SI2 shall remain assigned to SU1, and based on
>>>> other
>>>>      >>> configuration parameters and N-way redundancy model
>>>> semantics, even
>>>>      >> SI1
>>>>      >>> might get reassigned to SU1."
>>>>      >>>
>>>>      >>> My comment on V2:
>>>>      >>>
>>>>      >>> The configuration in #2233 is different from the example in
>>>>      >>> specification, but it sounds to me the attempt to instantiate
>>>> failed
>>>>      >>> component should be done as soon as possible.
>>>>      >>> The check in V2 patch means the failed component won't be
>>>> instantiated
>>>>      >>> if its SU still has any assignment. It should be true to 2N
>>>> and N+M, but
>>>>      >>> it's not for other SG. (As the example in specification, S2
>>>> does not
>>>>      >>> have any CSI assigned to failed component C2).
>>>>      >> [Praveen]As of now we have documented in the PR doc
>>>> (conformance table
>>>>      >> section 3.11.1.3 Recovery) that if a component faults with
>>>> comp-failover
>>>>      >> recovery then AMFD switch-overs the whole SU for N-Way, N-Way
>>>> Active
>>>>      >> and
>>>>      >> N+M models also. This is just to highlight about other red
>>>> models. But
>>>>      >> this documentation is not clear for an unassigned comp.
>>>>      >> But from the beginning, comp-failover is working this way
>>>> only. At-least
>>>>      >> from clean up perspective we have fixed the problem of
>>>> parallelism in
>>>>      >> the past in the ticket #474.
>>>>      >>
>>>>      >> One more thing I have noted, proxy-proxied implementation is
>>>> based on
>>>>      >> B.01.01. As per B.01.01, proxy will register himself and its
>>>> proxied as
>>>>      >> soon as it gets instantiated. In a configuration containing
>>>> both proxy
>>>>      >> and proxied comp, if the proxy does not get any CSI and it
>>>> faults with
>>>>      >> comp-failover recovery then in instantiation phase it may
>>>> again register
>>>>      >> its proxy. I think proxy in other SU should register its
>>>> proxied. I
>>>>      >> guess, from deployment perspective such a configuration in
>>>> which a user
>>>>      >> configures proxy without any CSI may not exists and only
>>>> possibility is
>>>>      >> an application modeling a legacy code in NoRed model. However,
>>>> in the
>>>>      >> later version of spec B.01.02, proxied was supposed to mention
>>>> the name
>>>>      >> of proxy CSI and thus proxy should register only when its get
>>>> proxy CSI.
>>>>      >>
>>>>      >> One more point to be noted comp-failover can also be done as a
>>>> part of
>>>>      >> escalation also. If a component is instantiated before the
>>>> completion of
>>>>      >> comp-failover recovery and if faults again then it may
>>>> escalate to
>>>>      >> node-failover before completion of comp-failover recovery.
>>>>      >>
>>>>      >> Since in spec there is no specific discussion for
>>>> comp-failover recovery
>>>>      >> for an unassigned comp, I will encourage other maintainers
>>>> also to
>>>>      >> provide inputs.
>>>>      >>
>>>>      >>
>>>>      >> Thanks,
>>>>      >> Praveen
>>>>      >>
>>>>      >>
>>>>      >>    Moreover, in the clc.cc,
>>>>      >>> amfnd does not check any of si_list.n_nodes, this probably is
>>>> the logic
>>>>      >>> that has being done so far.
>>>>      >>>
>>>>      >>> Thanks,
>>>>      >>> Minh
>>>>      >>>
>>>>      >>> On 17/02/17 23:16, praveen malviya wrote:
>>>>      >>>> Hi Minh,
>>>>      >>>>
>>>>      >>>> I think we should see this problem from fault management
>>>> perspective
>>>>      >>>> also. Here repair of failed component is performed before the
>>>>      >>>> completion of recovery.In the problem, component faulted with
>>>>      >>>> comp-failover recovery and it was successfully
>>>> repaired(instantiated)
>>>>      >>>> when SU switch-over was still pending.
>>>>      >>>>
>>>>      >>>> Now the question is: Why it was never observed earlier? The
>>>> reason is
>>>>      >>>> generally all components are assigned at least one CSI. In
>>>> the present
>>>>      >>>> configuration failed component was not assigned any CSI.
>>>> When this
>>>>      >>>> component was cleaned up and marked UNINSTANTIATED, AMFND
>>>> sent
>>>>      >>>> comp-failover recovery request to AMFD. But after sending
>>>> recovery
>>>>      >>>> request, it instantiated failed comp when SU has still
>>>> assignments to
>>>>      >>>> be switch-overed. The code related to this assumes that comp
>>>> will have
>>>>      >>>> at-least one CSI assigned to it (clc.cc
>>>> avnd_comp_clc_st_chng_prc(),
>>>>      >>>> TERMINATING to UNINSTANTIATED if block). For normal
>>>> sequence of
>>>>      >>>> comp-failover, su is repaired after removal of assignment in
>>>>      >>>> avnd_su_si_oper_done() by calling avnd_err_su_repair().
>>>>      >>>>
>>>>      >>>> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery
>>>> Action page
>>>>      >>>> 195) about switch-overing all the SIs of failed SU in case of
>>>>      >>>> comp-failed recovery and not for other models. In current
>>>> OpenSAF
>>>>      >>>> implementation we are following this for all models.
>>>>      >>>>
>>>>      >>>> I think as a fix we should stop failed comp to get
>>>> instantiated before
>>>>      >>>> removal of assignments. For this the check in clc.cc can be
>>>> hardened
>>>>      >>>> to consider non-assigned comp failures.
>>>>      >>>> Attached is the patch (2233_v2.patch) based on this
>>>> idea/approach.
>>>>      >>>>
>>>>      >>>> Thanks,
>>>>      >>>> Praveen
>>>>      >>>>
>>>>      >>>>
>>>>      >>>> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
>>>>      >>>>> Hi Praveen,
>>>>      >>>>>
>>>>      >>>>> Yes, you are right, I will update the description.
>>>>      >>>>>
>>>>      >>>>> Thanks, Minh
>>>>      >>>>>
>>>>      >>>>> Quoting praveen malviya <praveen.malv...@oracle.com>:
>>>>      >>>>>
>>>>      >>>>>> Hi Minh,
>>>>      >>>>>>
>>>>      >>>>>> One quick question:
>>>>      >>>>>> Ticket description says:
>>>>      >>>>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1
>>>>      >> depends
>>>>      >>>>>> safSi=AmfDemoTwon"
>>>>      >>>>>> But logs are related to without SIdep. Also in the
>>>> configuration
>>>>      >>>>>> app3_twon3su3si.xml, SI dep classes are commented.
>>>>      >>>>>> I think ticket description needs correction as problem is
>>>> without SI
>>>>      >>>>>> dep.
>>>>      >>>>>> Please confirm.
>>>>      >>>>>>
>>>>      >>>>>> Thanks,
>>>>      >>>>>> Praveen
>>>>      >>>>>>
>>>>      >>>>>>
>>>>      >>>>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
>>>>      >>>>>>> Hi Minh,
>>>>      >>>>>>>
>>>>      >>>>>>> I have started reviewing this patch.
>>>>      >>>>>>>
>>>>      >>>>>>> Thanks,
>>>>      >>>>>>> Praveen
>>>>      >>>>>>>
>>>>      >>>>>>> On 15-Feb-17 9:22 AM, minh chau wrote:
>>>>      >>>>>>>> Hi all,
>>>>      >>>>>>>>
>>>>      >>>>>>>> Have you had time to review this patch?
>>>>      >>>>>>>> It changes the component failover sequence, so I think
>>>> we need
>>>>      >> more
>>>>      >>>>>>>> time
>>>>      >>>>>>>> to look at it.
>>>>      >>>>>>>>
>>>>      >>>>>>>> Thanks,
>>>>      >>>>>>>> Minh
>>>>      >>>>>>>>
>>>>      >>>>>>>> On 23/01/17 12:28, Minh Hon Chau wrote:
>>>>      >>>>>>>>> src/amf/amfnd/avnd_su.h |   1 +
>>>>      >>>>>>>>> src/amf/amfnd/clc.cc    |   3 ---
>>>>      >>>>>>>>> src/amf/amfnd/di.cc     |  12 +++++++++++-
>>>>      >>>>>>>>> src/amf/amfnd/susm.cc   |  32
>>>>      >> +++++++++++++++++++++++++++++---
>>>>      >>>>>>>>>   4 files changed, 41 insertions(+), 7 deletions(-)
>>>>      >>>>>>>>>
>>>>      >>>>>>>>>
>>>>      >>>>>>>>> In case component failover, faulty component will be
>>>> terminated.
>>>>      >>>>>>>>> When
>>>>      >>>>>>>>> the reinstantiation
>>>>      >>>>>>>>> is done, amfnd will send su_oper_message (enabled) to
>>>> amfd which
>>>>      >> is
>>>>      >>>>>>>>> running along with
>>>>      >>>>>>>>> component failover. In the reported problem, if
>>>> su_oper_message
>>>>      >>>>>>>>> (enabled) comes to amfd
>>>>      >>>>>>>>> before the quiesced assignment response (as part of
>>>> component
>>>>      >>>>>>>>> failover
>>>>      >>>>>>>>> sequence) comes to
>>>>      >>>>>>>>> amfd, then this quiesced assignment response is
>>>> ignored, thus
>>>>      >>>>>>>>> component failover will not
>>>>      >>>>>>>>> finish.
>>>>      >>>>>>>>>
>>>>      >>>>>>>>> The problem is in function susi_success_sg_realign with
>>>> act=5,
>>>>      >>>>>>>>> state=3, amfd always assumes
>>>>      >>>>>>>>> su having faulty component is OUT_OF_SERVICE. This
>>>> assumption is
>>>>      >>>>>>>>> true
>>>>      >>>>>>>>> in most of the time
>>>>      >>>>>>>>> when su_oper_message (enabled) comes a little later
>>>> than quiesced
>>>>      >>>>>>>>> assignment response. In fact
>>>>      >>>>>>>>> the su_oper_message (enabled) is not designed as part of
>>>>      >> component
>>>>      >>>>>>>>> failover sequence, thus it
>>>>      >>>>>>>>> can come any time during the failover. If amfd is
>>>> getting a bit
>>>>      >>>>>>>>> busier
>>>>      >>>>>>>>> with RTA update then
>>>>      >>>>>>>>> the faulty component has enough to reinstiantiate so
>>>> that amfnd
>>>>      >>>>>>>>> sends
>>>>      >>>>>>>>> su_oper_message (enabled)
>>>>      >>>>>>>>> before quiesced assignment response, the reported
>>>> problem will be
>>>>      >>>>>>>>> seen.
>>>>      >>>>>>>>>
>>>>      >>>>>>>>> This patch hardens the component failover sequence by
>>>> ensuring
>>>>      >> the
>>>>      >>>>>>>>> su_oper_message (enabled) to
>>>>      >>>>>>>>> be sent after su completes to remove assignment. This
>>>> approach
>>>>      >> comes
>>>>      >>>>>>>>> from the similarity in
>>>>      >>>>>>>>> su failover, where the su_oper_message (enabled) is
>>>> sent in repair
>>>>      >>>>>>>>> phase.
>>>>      >>>>>>>>>
>>>>      >>>>>>>>> diff --git a/src/amf/amfnd/avnd_su.h
>>>> b/src/amf/amfnd/avnd_su.h
>>>>      >>>>>>>>> --- a/src/amf/amfnd/avnd_su.h
>>>>      >>>>>>>>> +++ b/src/amf/amfnd/avnd_su.h
>>>>      >>>>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec
>>>> *avnd_silis
>>>>      >>>>>>>>>   extern struct avnd_su_si_rec
>>>> *avnd_silist_getprev(const struct
>>>>      >>>>>>>>> avnd_su_si_rec *);
>>>>      >>>>>>>>>   extern struct avnd_su_si_rec
>>>> *avnd_silist_getlast(void);
>>>>      >>>>>>>>>   extern bool sufailover_in_progress(const AVND_SU *su);
>>>>      >>>>>>>>> +extern bool componentfailover_in_progress(const
>>>> AVND_SU *su);
>>>>      >>>>>>>>>   extern bool sufailover_during_nodeswitchover(const
>>>> AVND_SU
>>>>      >> *su);
>>>>      >>>>>>>>>   extern bool all_csis_in_removed_state(const AVND_SU
>>>> *su);
>>>>      >>>>>>>>>   extern void su_reset_restart_count_in_comps(const
>>>> struct
>>>>      >>>>>>>>> avnd_cb_tag
>>>>      >>>>>>>>> *cb, const AVND_SU *su);
>>>>      >>>>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
>>>>      >>>>>>>>> --- a/src/amf/amfnd/clc.cc
>>>>      >>>>>>>>> +++ b/src/amf/amfnd/clc.cc
>>>>      >>>>>>>>> @@ -2381,9 +2381,6 @@ uint32_t
>>>>      >> avnd_comp_clc_terming_cleansucc
>>>>      >>>>>>>>> (m_AVND_SU_IS_FAILOVER(su))) {
>>>>      >>>>>>>>>           /* yes, request director to orchestrate
>>>> component
>>>>      >>>>>>>>> failover */
>>>>      >>>>>>>>>           rc = avnd_di_oper_send(cb, su,
>>>>      >> SA_AMF_COMPONENT_FAILOVER);
>>>>      >>>>>>>>> -
>>>>      >>>>>>>>> -        //Reset component-failover here. SU failover
>>>> is reset as
>>>>      >>>>>>>>> part
>>>>      >>>>>>>>> of REPAIRED admin op.
>>>>      >>>>>>>>> - m_AVND_SU_FAILOVER_RESET(su);
>>>>      >>>>>>>>>       }
>>>>      >>>>>>>>>         /*
>>>>      >>>>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>>      >>>>>>>>> --- a/src/amf/amfnd/di.cc
>>>>      >>>>>>>>> +++ b/src/amf/amfnd/di.cc
>>>>      >>>>>>>>> @@ -894,7 +894,17 @@ uint32_t
>>>>      >> avnd_di_susi_resp_send(AVND_CB
>>>>      >>>>>>>>>               }
>>>>      >>>>>>>>> m_AVND_SU_ALL_SI_RESET(su);
>>>>      >>>>>>>>>           }
>>>>      >>>>>>>>> -
>>>>      >>>>>>>>> +        if (componentfailover_in_progress(su)) {
>>>>      >>>>>>>>> +            if (all_csis_in_removed_state(su) ==
>>>> true) {
>>>>      >>>>>>>>> + bool is_en;
>>>>      >>>>>>>>> + m_AVND_SU_IS_ENABLED(su, is_en);
>>>>      >>>>>>>>> +                if (is_en) {
>>>>      >>>>>>>>> + if (avnd_di_oper_send(cb, su, 0) ==
>>>>      >>>>>>>>> NCSCC_RC_SUCCESS) {
>>>>      >>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>      >>>>>>>>> + }
>>>>      >>>>>>>>> +                }
>>>>      >>>>>>>>> +            }
>>>>      >>>>>>>>> +        }
>>>>      >>>>>>>>>       /* free the contents of avnd message */
>>>>      >>>>>>>>> avnd_msg_content_free(cb, &msg);
>>>>      >>>>>>>>>   diff --git a/src/amf/amfnd/susm.cc
>>>> b/src/amf/amfnd/susm.cc
>>>>      >>>>>>>>> --- a/src/amf/amfnd/susm.cc
>>>>      >>>>>>>>> +++ b/src/amf/amfnd/susm.cc
>>>>      >>>>>>>>> @@ -1633,10 +1633,22 @@ uint32_t
>>>>      >> avnd_su_pres_st_chng_prc(AVND_C
>>>>      >>>>>>>>> m_AVND_SU_IS_ENABLED(su, is_en);
>>>>      >>>>>>>>>               if (true == is_en) {
>>>>      >>>>>>>>> TRACE("SU oper state is enabled");
>>>>      >>>>>>>>> +                // do not send su_oper state if
>>>> component
>>>>      >>>>>>>>> failover is
>>>>      >>>>>>>>> in progress
>>>>      >>>>>>>>> m_AVND_SU_OPER_STATE_SET(su,
>>>>      >>>>>>>>> SA_AMF_OPERATIONAL_ENABLED);
>>>>      >>>>>>>>> -                rc = avnd_di_oper_send(cb, su, 0);
>>>>      >>>>>>>>> -                if (NCSCC_RC_SUCCESS != rc)
>>>>      >>>>>>>>> - goto done;
>>>>      >>>>>>>>> +                if (componentfailover_in_progress(su)
>>>> == true) {
>>>>      >>>>>>>>> + si =
>>>> reinterpret_cast<AVND_SU_SI_REC*>
>>>>      >>>>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
>>>>      >>>>>>>>> + if (si == nullptr ||
>>>>      >>>>>>>>> all_csis_in_removed_state(su)) {
>>>>      >>>>>>>>> +                        rc = avnd_di_oper_send(cb, su,
>>>> 0);
>>>>      >>>>>>>>> +                        if (rc != NCSCC_RC_SUCCESS)
>>>>      >>>>>>>>> +                            goto done;
>>>>      >>>>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>      >>>>>>>>> + }
>>>>      >>>>>>>>> +                } else {
>>>>      >>>>>>>>> + rc = avnd_di_oper_send(cb, su, 0);
>>>>      >>>>>>>>> + if (NCSCC_RC_SUCCESS != rc)
>>>>      >>>>>>>>> +                        goto done;
>>>>      >>>>>>>>> +                }
>>>>      >>>>>>>>>               }
>>>>      >>>>>>>>>               else
>>>>      >>>>>>>>> TRACE("SU oper state is disabled");
>>>>      >>>>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const
>>>>      >> AVND_S
>>>>      >>>>>>>>>   }
>>>>      >>>>>>>>>     /**
>>>>      >>>>>>>>> + * This function checks if the componentfailover is
>>>> going on.
>>>>      >>>>>>>>> + * @param su: ptr to the SU .
>>>>      >>>>>>>>> + *
>>>>      >>>>>>>>> + * @return true/false.
>>>>      >>>>>>>>> + */
>>>>      >>>>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) {
>>>>      >>>>>>>>> +    if ((su->sufailover == false) &&
>>>> (!m_AVND_SU_IS_RESTART(su))
>>>>      >> &&
>>>>      >>>>>>>>> + (avnd_cb->oper_state !=
>>>>      >> SA_AMF_OPERATIONAL_DISABLED) &&
>>>>      >>>>>>>>> (!su->is_ncs) &&
>>>>      >>>>>>>>> + m_AVND_SU_IS_FAILOVER(su))
>>>>      >>>>>>>>> +        return true;
>>>>      >>>>>>>>> +    return false;
>>>>      >>>>>>>>> +}
>>>>      >>>>>>>>> +
>>>>      >>>>>>>>> +/**
>>>>      >>>>>>>>>    * This function checks if the sufailover and node
>>>> switchover are
>>>>      >>>>>>>>> going on.
>>>>      >>>>>>>>>    * @param su: ptr to the SU .
>>>>      >>>>>>>>>    *
>>>>      >>>>>>>>>
>>>>      >>>>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>>      >>>>>>>
>>>>      >>>>>>>
>>>>      >>>>>>> Check out the vibrant tech community on one of the
>>>> world's most
>>>>      >>>>>>> engaging tech sites, SlashDot.org!
>>>> http://sdm.link/slashdot
>>>>      >>>>>>> _______________________________________________
>>>>      >>>>>>> Opensaf-devel mailing list
>>>>      >>>>>>> Opensaf-devel@lists.sourceforge.net
>>>>      >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>      >>>>>>>
>>>>      >>>>>
>>>>
>>>>
>>>>
>>>
>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

Reply via email to