Hi Praveen,

Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233 
and #1902, it also can fix the problem.
Here we have 2 solutions:
- The one I sent for review is letting the failed component to be 
instantiated, I think it is current behavior. But one change is that 
amfnd will not report su operational message to amfd until amfnd 
finishes removing the assignment of (faulty) su which contains the 
failed component
- The V2 patch postpones the instantiation of failed component. amfnd 
will instantiate the failed component (via avnd_err_su_repair) after 
amfnd finishes removing the assignment of faulty su.

So basically the difference is the time that the failed component should 
be instantiated.

Still in item 3.11.1.3.2:
"In a 2N or N+M redundancy model, SI2 also needs to be switched over; 
other-wise, the number of active service units would be higher than what 
is allowed by the redundancy model. However, in an Nway redundancy 
model, SI2 could be left assigned to SU1 (if the saAmfSUFailover 
configuration attribute of the ser-vice unit is set to SA_FALSE), and a 
repair of C2 should be attempted by reinstantiating it. If the attempt 
to instantiate C2 fails, the service unit becomes disabled, and SI2 must 
be switched-over; however, if the attempt to instantiate C2 is 
successful, SI2 shall remain assigned to SU1, and based on other 
configuration parameters and N-way redundancy model semantics, even SI1 
might get reassigned to SU1."

My comment on V2:

The configuration in #2233 is different from the example in 
specification, but it sounds to me the attempt to instantiate failed 
component should be done as soon as possible.
The check in V2 patch means the failed component won't be instantiated 
if its SU still has any assignment. It should be true to 2N and N+M, but 
it's not for other SG. (As the example in specification, S2 does not 
have any CSI assigned to failed component C2). Moreover, in the clc.cc, 
amfnd does not check any of si_list.n_nodes, this probably is the logic 
that has being done so far.

Thanks,
Minh

On 17/02/17 23:16, praveen malviya wrote:
> Hi Minh,
>
> I think we should see this problem from fault management perspective 
> also. Here repair of failed component is performed before the 
> completion of recovery.In the problem, component faulted with 
> comp-failover recovery and it was successfully repaired(instantiated) 
> when SU switch-over was still pending.
>
> Now the question is: Why it was never observed earlier? The reason is 
> generally all components are assigned at least one CSI. In the present 
> configuration failed component was not assigned any CSI. When this 
> component was cleaned up and marked UNINSTANTIATED, AMFND sent 
> comp-failover recovery request to AMFD. But after sending recovery 
> request, it instantiated failed comp when SU has still assignments to 
> be switch-overed. The code related to this assumes that comp will have 
> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(), 
> TERMINATING to UNINSTANTIATED if block). For normal sequence of 
> comp-failover, su is repaired after removal of assignment in 
> avnd_su_si_oper_done() by calling avnd_err_su_repair().
>
> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page 
> 195) about switch-overing all the SIs of failed SU in case of 
> comp-failed recovery and not for other models. In current OpenSAF 
> implementation we are following this for all models.
>
> I think as a fix we should stop failed comp to get instantiated before 
> removal of assignments. For this the check in clc.cc can be hardened 
> to consider non-assigned comp failures.
> Attached is the patch (2233_v2.patch) based on this idea/approach.
>
> Thanks,
> Praveen
>
>
> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
>> Hi Praveen,
>>
>> Yes, you are right, I will update the description.
>>
>> Thanks, Minh
>>
>> Quoting praveen malviya <praveen.malv...@oracle.com>:
>>
>>> Hi Minh,
>>>
>>> One quick question:
>>> Ticket description says:
>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends
>>> safSi=AmfDemoTwon"
>>> But logs are related to without SIdep. Also in the configuration
>>> app3_twon3su3si.xml, SI dep classes are commented.
>>> I think ticket description needs correction as problem is without SI 
>>> dep.
>>> Please confirm.
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
>>>> Hi Minh,
>>>>
>>>> I have started reviewing this patch.
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>> On 15-Feb-17 9:22 AM, minh chau wrote:
>>>>> Hi all,
>>>>>
>>>>> Have you had time to review this patch?
>>>>> It changes the component failover sequence, so I think we need more
>>>>> time
>>>>> to look at it.
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>>
>>>>> On 23/01/17 12:28, Minh Hon Chau wrote:
>>>>>>  src/amf/amfnd/avnd_su.h |   1 +
>>>>>>  src/amf/amfnd/clc.cc    |   3 ---
>>>>>>  src/amf/amfnd/di.cc     |  12 +++++++++++-
>>>>>>  src/amf/amfnd/susm.cc   |  32 +++++++++++++++++++++++++++++---
>>>>>>  4 files changed, 41 insertions(+), 7 deletions(-)
>>>>>>
>>>>>>
>>>>>> In case component failover, faulty component will be terminated. 
>>>>>> When
>>>>>> the reinstantiation
>>>>>> is done, amfnd will send su_oper_message (enabled) to amfd which is
>>>>>> running along with
>>>>>> component failover. In the reported problem, if su_oper_message
>>>>>> (enabled) comes to amfd
>>>>>> before the quiesced assignment response (as part of component 
>>>>>> failover
>>>>>> sequence) comes to
>>>>>> amfd, then this quiesced assignment response is ignored, thus
>>>>>> component failover will not
>>>>>> finish.
>>>>>>
>>>>>> The problem is in function susi_success_sg_realign with act=5,
>>>>>> state=3, amfd always assumes
>>>>>> su having faulty component is OUT_OF_SERVICE. This assumption is 
>>>>>> true
>>>>>> in most of the time
>>>>>> when su_oper_message (enabled) comes a little later than quiesced
>>>>>> assignment response. In fact
>>>>>> the su_oper_message (enabled) is not designed as part of component
>>>>>> failover sequence, thus it
>>>>>> can come any time during the failover. If amfd is getting a bit 
>>>>>> busier
>>>>>> with RTA update then
>>>>>> the faulty component has enough to reinstiantiate so that amfnd 
>>>>>> sends
>>>>>> su_oper_message (enabled)
>>>>>> before quiesced assignment response, the reported problem will be
>>>>>> seen.
>>>>>>
>>>>>> This patch hardens the component failover sequence by ensuring the
>>>>>> su_oper_message (enabled) to
>>>>>> be sent after su completes to remove assignment. This approach comes
>>>>>> from the similarity in
>>>>>> su failover, where the su_oper_message (enabled) is sent in repair
>>>>>> phase.
>>>>>>
>>>>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
>>>>>> --- a/src/amf/amfnd/avnd_su.h
>>>>>> +++ b/src/amf/amfnd/avnd_su.h
>>>>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
>>>>>>  extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
>>>>>> avnd_su_si_rec *);
>>>>>>  extern struct avnd_su_si_rec *avnd_silist_getlast(void);
>>>>>>  extern bool sufailover_in_progress(const AVND_SU *su);
>>>>>> +extern bool componentfailover_in_progress(const AVND_SU *su);
>>>>>>  extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
>>>>>>  extern bool all_csis_in_removed_state(const AVND_SU *su);
>>>>>>  extern void su_reset_restart_count_in_comps(const struct 
>>>>>> avnd_cb_tag
>>>>>> *cb, const AVND_SU *su);
>>>>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
>>>>>> --- a/src/amf/amfnd/clc.cc
>>>>>> +++ b/src/amf/amfnd/clc.cc
>>>>>> @@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
>>>>>>              (m_AVND_SU_IS_FAILOVER(su))) {
>>>>>>          /* yes, request director to orchestrate component 
>>>>>> failover */
>>>>>>          rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
>>>>>> -
>>>>>> -        //Reset component-failover here. SU failover is reset as 
>>>>>> part
>>>>>> of REPAIRED admin op.
>>>>>> -        m_AVND_SU_FAILOVER_RESET(su);
>>>>>>      }
>>>>>>        /*
>>>>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>>>>> --- a/src/amf/amfnd/di.cc
>>>>>> +++ b/src/amf/amfnd/di.cc
>>>>>> @@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>>>>>              }
>>>>>>              m_AVND_SU_ALL_SI_RESET(su);
>>>>>>          }
>>>>>> -
>>>>>> +        if (componentfailover_in_progress(su)) {
>>>>>> +            if (all_csis_in_removed_state(su) == true) {
>>>>>> +                bool is_en;
>>>>>> +                m_AVND_SU_IS_ENABLED(su, is_en);
>>>>>> +                if (is_en) {
>>>>>> +                    if (avnd_di_oper_send(cb, su, 0) ==
>>>>>> NCSCC_RC_SUCCESS) {
>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>>> +                    }
>>>>>> +                }
>>>>>> +            }
>>>>>> +        }
>>>>>>      /* free the contents of avnd message */
>>>>>>      avnd_msg_content_free(cb, &msg);
>>>>>>  diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
>>>>>> --- a/src/amf/amfnd/susm.cc
>>>>>> +++ b/src/amf/amfnd/susm.cc
>>>>>> @@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
>>>>>>              m_AVND_SU_IS_ENABLED(su, is_en);
>>>>>>              if (true == is_en) {
>>>>>>                  TRACE("SU oper state is enabled");
>>>>>> +                // do not send su_oper state if component 
>>>>>> failover is
>>>>>> in progress
>>>>>>                  m_AVND_SU_OPER_STATE_SET(su,
>>>>>> SA_AMF_OPERATIONAL_ENABLED);
>>>>>> -                rc = avnd_di_oper_send(cb, su, 0);
>>>>>> -                if (NCSCC_RC_SUCCESS != rc)
>>>>>> -                    goto done;
>>>>>> +                if (componentfailover_in_progress(su) == true) {
>>>>>> +                    si = reinterpret_cast<AVND_SU_SI_REC*>
>>>>>> + (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
>>>>>> +                    if (si == nullptr ||
>>>>>> all_csis_in_removed_state(su)) {
>>>>>> +                        rc = avnd_di_oper_send(cb, su, 0);
>>>>>> +                        if (rc != NCSCC_RC_SUCCESS)
>>>>>> +                            goto done;
>>>>>> +                        m_AVND_SU_FAILOVER_RESET(su);
>>>>>> +                    }
>>>>>> +                } else {
>>>>>> +                    rc = avnd_di_oper_send(cb, su, 0);
>>>>>> +                    if (NCSCC_RC_SUCCESS != rc)
>>>>>> +                        goto done;
>>>>>> +                }
>>>>>>              }
>>>>>>              else
>>>>>>                  TRACE("SU oper state is disabled");
>>>>>> @@ -3551,6 +3563,20 @@ bool sufailover_in_progress(const AVND_S
>>>>>>  }
>>>>>>    /**
>>>>>> + * This function checks if the componentfailover is going on.
>>>>>> + * @param su: ptr to the SU .
>>>>>> + *
>>>>>> + * @return true/false.
>>>>>> + */
>>>>>> +bool componentfailover_in_progress(const AVND_SU *su) {
>>>>>> +    if ((su->sufailover == false) && (!m_AVND_SU_IS_RESTART(su)) &&
>>>>>> +            (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) &&
>>>>>> (!su->is_ncs) &&
>>>>>> +            m_AVND_SU_IS_FAILOVER(su))
>>>>>> +        return true;
>>>>>> +    return false;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>>   * This function checks if the sufailover and node switchover are
>>>>>> going on.
>>>>>>   * @param su: ptr to the SU .
>>>>>>   *
>>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>  
>>>>
>>>>
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Opensaf-devel mailing list
>>>> Opensaf-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>
>>
>>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to