[devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-01-22 Thread Minh Hon Chau
 src/amf/amfnd/avnd_su.h |   1 +
 src/amf/amfnd/clc.cc|   3 ---
 src/amf/amfnd/di.cc |  12 +++-
 src/amf/amfnd/susm.cc   |  32 +---
 4 files changed, 41 insertions(+), 7 deletions(-)


In case component failover, faulty component will be terminated. When the 
reinstantiation
is done, amfnd will send su_oper_message (enabled) to amfd which is running 
along with
component failover. In the reported problem, if su_oper_message (enabled) comes 
to amfd
before the quiesced assignment response (as part of component failover 
sequence) comes to
amfd, then this quiesced assignment response is ignored, thus component 
failover will not
finish.

The problem is in function susi_success_sg_realign with act=5, state=3, amfd 
always assumes
su having faulty component is OUT_OF_SERVICE. This assumption is true in most 
of the time
when su_oper_message (enabled) comes a little later than quiesced assignment 
response. In fact
the su_oper_message (enabled) is not designed as part of component failover 
sequence, thus it
can come any time during the failover. If amfd is getting a bit busier with RTA 
update then
the faulty component has enough to reinstiantiate so that amfnd sends 
su_oper_message (enabled)
before quiesced assignment response, the reported problem will be seen.

This patch hardens the component failover sequence by ensuring the 
su_oper_message (enabled) to
be sent after su completes to remove assignment. This approach comes from the 
similarity in
su failover, where the su_oper_message (enabled) is sent in repair phase.

diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
--- a/src/amf/amfnd/avnd_su.h
+++ b/src/amf/amfnd/avnd_su.h
@@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
 extern struct avnd_su_si_rec *avnd_silist_getprev(const struct avnd_su_si_rec 
*);
 extern struct avnd_su_si_rec *avnd_silist_getlast(void);
 extern bool sufailover_in_progress(const AVND_SU *su);
+extern bool componentfailover_in_progress(const AVND_SU *su);
 extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
 extern bool all_csis_in_removed_state(const AVND_SU *su);
 extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag *cb, 
const AVND_SU *su);
diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
--- a/src/amf/amfnd/clc.cc
+++ b/src/amf/amfnd/clc.cc
@@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
(m_AVND_SU_IS_FAILOVER(su))) {
/* yes, request director to orchestrate component failover */
rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
-
-   //Reset component-failover here. SU failover is reset as part 
of REPAIRED admin op.
-   m_AVND_SU_FAILOVER_RESET(su);
}
 
/*
diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
--- a/src/amf/amfnd/di.cc
+++ b/src/amf/amfnd/di.cc
@@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB 
}
m_AVND_SU_ALL_SI_RESET(su);
 }
-
+if (componentfailover_in_progress(su)) {
+   if (all_csis_in_removed_state(su) == true) {
+   bool is_en;
+   m_AVND_SU_IS_ENABLED(su, is_en);
+   if (is_en) {
+   if (avnd_di_oper_send(cb, su, 0) == 
NCSCC_RC_SUCCESS) {
+   m_AVND_SU_FAILOVER_RESET(su);
+   }
+   }
+   }
+}
/* free the contents of avnd message */
avnd_msg_content_free(cb, &msg);
 
diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
--- a/src/amf/amfnd/susm.cc
+++ b/src/amf/amfnd/susm.cc
@@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
m_AVND_SU_IS_ENABLED(su, is_en);
if (true == is_en) {
TRACE("SU oper state is enabled");
+   // do not send su_oper state if component 
failover is in progress
m_AVND_SU_OPER_STATE_SET(su, 
SA_AMF_OPERATIONAL_ENABLED);
-   rc = avnd_di_oper_send(cb, su, 0);
-   if (NCSCC_RC_SUCCESS != rc)
-   goto done;
+   if (componentfailover_in_progress(su) == true) {
+   si = reinterpret_cast
+   
(m_NCS_DBLIST_FIND_FIRST(&su->si_list));
+   if (si == nullptr || 
all_csis_in_removed_state(su)) {
+   rc = avnd_di_oper_send(cb, su, 
0);
+   if (rc != NCSCC_RC_SUCCESS)
+   goto done;
+   m_AVND_SU_FAILOVER_RESET(su);
+ 

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-14 Thread minh chau
Hi all,

Have you had time to review this patch?
It changes the component failover sequence, so I think we need more time 
to look at it.

Thanks,
Minh

On 23/01/17 12:28, Minh Hon Chau wrote:
>   src/amf/amfnd/avnd_su.h |   1 +
>   src/amf/amfnd/clc.cc|   3 ---
>   src/amf/amfnd/di.cc |  12 +++-
>   src/amf/amfnd/susm.cc   |  32 +---
>   4 files changed, 41 insertions(+), 7 deletions(-)
>
>
> In case component failover, faulty component will be terminated. When the 
> reinstantiation
> is done, amfnd will send su_oper_message (enabled) to amfd which is running 
> along with
> component failover. In the reported problem, if su_oper_message (enabled) 
> comes to amfd
> before the quiesced assignment response (as part of component failover 
> sequence) comes to
> amfd, then this quiesced assignment response is ignored, thus component 
> failover will not
> finish.
>
> The problem is in function susi_success_sg_realign with act=5, state=3, amfd 
> always assumes
> su having faulty component is OUT_OF_SERVICE. This assumption is true in most 
> of the time
> when su_oper_message (enabled) comes a little later than quiesced assignment 
> response. In fact
> the su_oper_message (enabled) is not designed as part of component failover 
> sequence, thus it
> can come any time during the failover. If amfd is getting a bit busier with 
> RTA update then
> the faulty component has enough to reinstiantiate so that amfnd sends 
> su_oper_message (enabled)
> before quiesced assignment response, the reported problem will be seen.
>
> This patch hardens the component failover sequence by ensuring the 
> su_oper_message (enabled) to
> be sent after su completes to remove assignment. This approach comes from the 
> similarity in
> su failover, where the su_oper_message (enabled) is sent in repair phase.
>
> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
> --- a/src/amf/amfnd/avnd_su.h
> +++ b/src/amf/amfnd/avnd_su.h
> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
>   extern struct avnd_su_si_rec *avnd_silist_getprev(const struct 
> avnd_su_si_rec *);
>   extern struct avnd_su_si_rec *avnd_silist_getlast(void);
>   extern bool sufailover_in_progress(const AVND_SU *su);
> +extern bool componentfailover_in_progress(const AVND_SU *su);
>   extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
>   extern bool all_csis_in_removed_state(const AVND_SU *su);
>   extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag *cb, 
> const AVND_SU *su);
> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
> --- a/src/amf/amfnd/clc.cc
> +++ b/src/amf/amfnd/clc.cc
> @@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
>   (m_AVND_SU_IS_FAILOVER(su))) {
>   /* yes, request director to orchestrate component failover */
>   rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
> -
> - //Reset component-failover here. SU failover is reset as part 
> of REPAIRED admin op.
> - m_AVND_SU_FAILOVER_RESET(su);
>   }
>   
>   /*
> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
> --- a/src/amf/amfnd/di.cc
> +++ b/src/amf/amfnd/di.cc
> @@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>   }
>   m_AVND_SU_ALL_SI_RESET(su);
>   }
> -
> +if (componentfailover_in_progress(su)) {
> + if (all_csis_in_removed_state(su) == true) {
> + bool is_en;
> + m_AVND_SU_IS_ENABLED(su, is_en);
> + if (is_en) {
> + if (avnd_di_oper_send(cb, su, 0) == 
> NCSCC_RC_SUCCESS) {
> + m_AVND_SU_FAILOVER_RESET(su);
> + }
> + }
> + }
> +}
>   /* free the contents of avnd message */
>   avnd_msg_content_free(cb, &msg);
>   
> diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
> --- a/src/amf/amfnd/susm.cc
> +++ b/src/amf/amfnd/susm.cc
> @@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
>   m_AVND_SU_IS_ENABLED(su, is_en);
>   if (true == is_en) {
>   TRACE("SU oper state is enabled");
> + // do not send su_oper state if component 
> failover is in progress
>   m_AVND_SU_OPER_STATE_SET(su, 
> SA_AMF_OPERATIONAL_ENABLED);
> - rc = avnd_di_oper_send(cb, su, 0);
> - if (NCSCC_RC_SUCCESS != rc)
> - goto done;
> + if (componentfailover_in_progress(su) == true) {
> + si = reinterpret_cast
> + 
> (m_NCS_DBLIST_FIND_FIRST(&su->si_list));
> + if (si == nul

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-16 Thread praveen malviya
Hi Minh,

I have started reviewing this patch.

Thanks,
Praveen

On 15-Feb-17 9:22 AM, minh chau wrote:
> Hi all,
>
> Have you had time to review this patch?
> It changes the component failover sequence, so I think we need more time
> to look at it.
>
> Thanks,
> Minh
>
> On 23/01/17 12:28, Minh Hon Chau wrote:
>>   src/amf/amfnd/avnd_su.h |   1 +
>>   src/amf/amfnd/clc.cc|   3 ---
>>   src/amf/amfnd/di.cc |  12 +++-
>>   src/amf/amfnd/susm.cc   |  32 +---
>>   4 files changed, 41 insertions(+), 7 deletions(-)
>>
>>
>> In case component failover, faulty component will be terminated. When
>> the reinstantiation
>> is done, amfnd will send su_oper_message (enabled) to amfd which is
>> running along with
>> component failover. In the reported problem, if su_oper_message
>> (enabled) comes to amfd
>> before the quiesced assignment response (as part of component failover
>> sequence) comes to
>> amfd, then this quiesced assignment response is ignored, thus
>> component failover will not
>> finish.
>>
>> The problem is in function susi_success_sg_realign with act=5,
>> state=3, amfd always assumes
>> su having faulty component is OUT_OF_SERVICE. This assumption is true
>> in most of the time
>> when su_oper_message (enabled) comes a little later than quiesced
>> assignment response. In fact
>> the su_oper_message (enabled) is not designed as part of component
>> failover sequence, thus it
>> can come any time during the failover. If amfd is getting a bit busier
>> with RTA update then
>> the faulty component has enough to reinstiantiate so that amfnd sends
>> su_oper_message (enabled)
>> before quiesced assignment response, the reported problem will be seen.
>>
>> This patch hardens the component failover sequence by ensuring the
>> su_oper_message (enabled) to
>> be sent after su completes to remove assignment. This approach comes
>> from the similarity in
>> su failover, where the su_oper_message (enabled) is sent in repair phase.
>>
>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
>> --- a/src/amf/amfnd/avnd_su.h
>> +++ b/src/amf/amfnd/avnd_su.h
>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
>>   extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
>> avnd_su_si_rec *);
>>   extern struct avnd_su_si_rec *avnd_silist_getlast(void);
>>   extern bool sufailover_in_progress(const AVND_SU *su);
>> +extern bool componentfailover_in_progress(const AVND_SU *su);
>>   extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
>>   extern bool all_csis_in_removed_state(const AVND_SU *su);
>>   extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag
>> *cb, const AVND_SU *su);
>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
>> --- a/src/amf/amfnd/clc.cc
>> +++ b/src/amf/amfnd/clc.cc
>> @@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
>>   (m_AVND_SU_IS_FAILOVER(su))) {
>>   /* yes, request director to orchestrate component failover */
>>   rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
>> -
>> -//Reset component-failover here. SU failover is reset as part
>> of REPAIRED admin op.
>> -m_AVND_SU_FAILOVER_RESET(su);
>>   }
>> /*
>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>> --- a/src/amf/amfnd/di.cc
>> +++ b/src/amf/amfnd/di.cc
>> @@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>   }
>>   m_AVND_SU_ALL_SI_RESET(su);
>>   }
>> -
>> +if (componentfailover_in_progress(su)) {
>> +if (all_csis_in_removed_state(su) == true) {
>> +bool is_en;
>> +m_AVND_SU_IS_ENABLED(su, is_en);
>> +if (is_en) {
>> +if (avnd_di_oper_send(cb, su, 0) ==
>> NCSCC_RC_SUCCESS) {
>> +m_AVND_SU_FAILOVER_RESET(su);
>> +}
>> +}
>> +}
>> +}
>>   /* free the contents of avnd message */
>>   avnd_msg_content_free(cb, &msg);
>>   diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
>> --- a/src/amf/amfnd/susm.cc
>> +++ b/src/amf/amfnd/susm.cc
>> @@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
>>   m_AVND_SU_IS_ENABLED(su, is_en);
>>   if (true == is_en) {
>>   TRACE("SU oper state is enabled");
>> +// do not send su_oper state if component failover is
>> in progress
>>   m_AVND_SU_OPER_STATE_SET(su,
>> SA_AMF_OPERATIONAL_ENABLED);
>> -rc = avnd_di_oper_send(cb, su, 0);
>> -if (NCSCC_RC_SUCCESS != rc)
>> -goto done;
>> +if (componentfailover_in_progress(su) == true) {
>> +si = reinterpret_cast
>> +(m_NCS_DBLIST_FIND_FIRST(&su->si_list));
>> +if (si == nullptr ||
>> all_csis_in_removed_state(su)) {
>> +  

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-16 Thread praveen malviya
Hi Minh,

One quick question:
Ticket description says:
"Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends 
safSi=AmfDemoTwon"
But logs are related to without SIdep. Also in the configuration 
app3_twon3su3si.xml, SI dep classes are commented.
I think ticket description needs correction as problem is without SI dep.
Please confirm.

Thanks,
Praveen


On 17-Feb-17 10:58 AM, praveen malviya wrote:
> Hi Minh,
>
> I have started reviewing this patch.
>
> Thanks,
> Praveen
>
> On 15-Feb-17 9:22 AM, minh chau wrote:
>> Hi all,
>>
>> Have you had time to review this patch?
>> It changes the component failover sequence, so I think we need more time
>> to look at it.
>>
>> Thanks,
>> Minh
>>
>> On 23/01/17 12:28, Minh Hon Chau wrote:
>>>   src/amf/amfnd/avnd_su.h |   1 +
>>>   src/amf/amfnd/clc.cc|   3 ---
>>>   src/amf/amfnd/di.cc |  12 +++-
>>>   src/amf/amfnd/susm.cc   |  32 +---
>>>   4 files changed, 41 insertions(+), 7 deletions(-)
>>>
>>>
>>> In case component failover, faulty component will be terminated. When
>>> the reinstantiation
>>> is done, amfnd will send su_oper_message (enabled) to amfd which is
>>> running along with
>>> component failover. In the reported problem, if su_oper_message
>>> (enabled) comes to amfd
>>> before the quiesced assignment response (as part of component failover
>>> sequence) comes to
>>> amfd, then this quiesced assignment response is ignored, thus
>>> component failover will not
>>> finish.
>>>
>>> The problem is in function susi_success_sg_realign with act=5,
>>> state=3, amfd always assumes
>>> su having faulty component is OUT_OF_SERVICE. This assumption is true
>>> in most of the time
>>> when su_oper_message (enabled) comes a little later than quiesced
>>> assignment response. In fact
>>> the su_oper_message (enabled) is not designed as part of component
>>> failover sequence, thus it
>>> can come any time during the failover. If amfd is getting a bit busier
>>> with RTA update then
>>> the faulty component has enough to reinstiantiate so that amfnd sends
>>> su_oper_message (enabled)
>>> before quiesced assignment response, the reported problem will be seen.
>>>
>>> This patch hardens the component failover sequence by ensuring the
>>> su_oper_message (enabled) to
>>> be sent after su completes to remove assignment. This approach comes
>>> from the similarity in
>>> su failover, where the su_oper_message (enabled) is sent in repair phase.
>>>
>>> diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
>>> --- a/src/amf/amfnd/avnd_su.h
>>> +++ b/src/amf/amfnd/avnd_su.h
>>> @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
>>>   extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
>>> avnd_su_si_rec *);
>>>   extern struct avnd_su_si_rec *avnd_silist_getlast(void);
>>>   extern bool sufailover_in_progress(const AVND_SU *su);
>>> +extern bool componentfailover_in_progress(const AVND_SU *su);
>>>   extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
>>>   extern bool all_csis_in_removed_state(const AVND_SU *su);
>>>   extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag
>>> *cb, const AVND_SU *su);
>>> diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
>>> --- a/src/amf/amfnd/clc.cc
>>> +++ b/src/amf/amfnd/clc.cc
>>> @@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
>>>   (m_AVND_SU_IS_FAILOVER(su))) {
>>>   /* yes, request director to orchestrate component failover */
>>>   rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
>>> -
>>> -//Reset component-failover here. SU failover is reset as part
>>> of REPAIRED admin op.
>>> -m_AVND_SU_FAILOVER_RESET(su);
>>>   }
>>> /*
>>> diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
>>> --- a/src/amf/amfnd/di.cc
>>> +++ b/src/amf/amfnd/di.cc
>>> @@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>>   }
>>>   m_AVND_SU_ALL_SI_RESET(su);
>>>   }
>>> -
>>> +if (componentfailover_in_progress(su)) {
>>> +if (all_csis_in_removed_state(su) == true) {
>>> +bool is_en;
>>> +m_AVND_SU_IS_ENABLED(su, is_en);
>>> +if (is_en) {
>>> +if (avnd_di_oper_send(cb, su, 0) ==
>>> NCSCC_RC_SUCCESS) {
>>> +m_AVND_SU_FAILOVER_RESET(su);
>>> +}
>>> +}
>>> +}
>>> +}
>>>   /* free the contents of avnd message */
>>>   avnd_msg_content_free(cb, &msg);
>>>   diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
>>> --- a/src/amf/amfnd/susm.cc
>>> +++ b/src/amf/amfnd/susm.cc
>>> @@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_st_chng_prc(AVND_C
>>>   m_AVND_SU_IS_ENABLED(su, is_en);
>>>   if (true == is_en) {
>>>   TRACE("SU oper state is enabled");
>>> +// do not send su_oper state if comp

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-16 Thread Minh Hon CHAU
Hi Praveen,

Yes, you are right, I will update the description.

Thanks, Minh

Quoting praveen malviya :

> Hi Minh,
>
> One quick question:
> Ticket description says:
> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends  
> safSi=AmfDemoTwon"
> But logs are related to without SIdep. Also in the configuration  
> app3_twon3su3si.xml, SI dep classes are commented.
> I think ticket description needs correction as problem is without SI dep.
> Please confirm.
>
> Thanks,
> Praveen
>
>
> On 17-Feb-17 10:58 AM, praveen malviya wrote:
>> Hi Minh,
>>
>> I have started reviewing this patch.
>>
>> Thanks,
>> Praveen
>>
>> On 15-Feb-17 9:22 AM, minh chau wrote:
>>> Hi all,
>>>
>>> Have you had time to review this patch?
>>> It changes the component failover sequence, so I think we need more time
>>> to look at it.
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 23/01/17 12:28, Minh Hon Chau wrote:
  src/amf/amfnd/avnd_su.h |   1 +
  src/amf/amfnd/clc.cc|   3 ---
  src/amf/amfnd/di.cc |  12 +++-
  src/amf/amfnd/susm.cc   |  32 +---
  4 files changed, 41 insertions(+), 7 deletions(-)


 In case component failover, faulty component will be terminated. When
 the reinstantiation
 is done, amfnd will send su_oper_message (enabled) to amfd which is
 running along with
 component failover. In the reported problem, if su_oper_message
 (enabled) comes to amfd
 before the quiesced assignment response (as part of component failover
 sequence) comes to
 amfd, then this quiesced assignment response is ignored, thus
 component failover will not
 finish.

 The problem is in function susi_success_sg_realign with act=5,
 state=3, amfd always assumes
 su having faulty component is OUT_OF_SERVICE. This assumption is true
 in most of the time
 when su_oper_message (enabled) comes a little later than quiesced
 assignment response. In fact
 the su_oper_message (enabled) is not designed as part of component
 failover sequence, thus it
 can come any time during the failover. If amfd is getting a bit busier
 with RTA update then
 the faulty component has enough to reinstiantiate so that amfnd sends
 su_oper_message (enabled)
 before quiesced assignment response, the reported problem will be seen.

 This patch hardens the component failover sequence by ensuring the
 su_oper_message (enabled) to
 be sent after su completes to remove assignment. This approach comes
 from the similarity in
 su failover, where the su_oper_message (enabled) is sent in repair phase.

 diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
 --- a/src/amf/amfnd/avnd_su.h
 +++ b/src/amf/amfnd/avnd_su.h
 @@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
  extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
 avnd_su_si_rec *);
  extern struct avnd_su_si_rec *avnd_silist_getlast(void);
  extern bool sufailover_in_progress(const AVND_SU *su);
 +extern bool componentfailover_in_progress(const AVND_SU *su);
  extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
  extern bool all_csis_in_removed_state(const AVND_SU *su);
  extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag
 *cb, const AVND_SU *su);
 diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
 --- a/src/amf/amfnd/clc.cc
 +++ b/src/amf/amfnd/clc.cc
 @@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
  (m_AVND_SU_IS_FAILOVER(su))) {
  /* yes, request director to orchestrate component failover */
  rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
 -
 -//Reset component-failover here. SU failover is reset as part
 of REPAIRED admin op.
 -m_AVND_SU_FAILOVER_RESET(su);
  }
/*
 diff --git a/src/amf/amfnd/di.cc b/src/amf/amfnd/di.cc
 --- a/src/amf/amfnd/di.cc
 +++ b/src/amf/amfnd/di.cc
 @@ -894,7 +894,17 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
  }
  m_AVND_SU_ALL_SI_RESET(su);
  }
 -
 +if (componentfailover_in_progress(su)) {
 +if (all_csis_in_removed_state(su) == true) {
 +bool is_en;
 +m_AVND_SU_IS_ENABLED(su, is_en);
 +if (is_en) {
 +if (avnd_di_oper_send(cb, su, 0) ==
 NCSCC_RC_SUCCESS) {
 +m_AVND_SU_FAILOVER_RESET(su);
 +}
 +}
 +}
 +}
  /* free the contents of avnd message */
  avnd_msg_content_free(cb, &msg);
  diff --git a/src/amf/amfnd/susm.cc b/src/amf/amfnd/susm.cc
 --- a/src/amf/amfnd/susm.cc
 +++ b/src/amf/amfnd/susm.cc
 @@ -1633,10 +1633,22 @@ uint32_t avnd_su_pres_

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-17 Thread praveen malviya

Hi Minh,

I think we should see this problem from fault management perspective 
also. Here repair of failed component is performed before the completion 
of recovery.In the problem, component faulted with comp-failover 
recovery and it was successfully repaired(instantiated) when SU 
switch-over was still pending.


Now the question is: Why it was never observed earlier? The reason is 
generally all components are assigned at least one CSI. In the present 
configuration failed component was not assigned any CSI. When this 
component was cleaned up and marked UNINSTANTIATED, AMFND sent 
comp-failover recovery request to AMFD. But after sending recovery 
request, it instantiated failed comp when SU has still assignments to be 
switch-overed. The code related to this assumes that comp will have 
at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(), 
TERMINATING to UNINSTANTIATED if block). For normal sequence of 
comp-failover, su is repaired after removal of assignment in 
avnd_su_si_oper_done() by calling avnd_err_su_repair().


For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page 
195) about switch-overing all the SIs of failed SU in case of 
comp-failed recovery and not for other models. In current OpenSAF 
implementation we are following this for all models.


I think as a fix we should stop failed comp to get instantiated before 
removal of assignments. For this the check in clc.cc can be hardened to 
consider non-assigned comp failures.

Attached is the patch (2233_v2.patch) based on this idea/approach.

Thanks,
Praveen


On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:

Hi Praveen,

Yes, you are right, I will update the description.

Thanks, Minh

Quoting praveen malviya :


Hi Minh,

One quick question:
Ticket description says:
"Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends
safSi=AmfDemoTwon"
But logs are related to without SIdep. Also in the configuration
app3_twon3su3si.xml, SI dep classes are commented.
I think ticket description needs correction as problem is without SI dep.
Please confirm.

Thanks,
Praveen


On 17-Feb-17 10:58 AM, praveen malviya wrote:

Hi Minh,

I have started reviewing this patch.

Thanks,
Praveen

On 15-Feb-17 9:22 AM, minh chau wrote:

Hi all,

Have you had time to review this patch?
It changes the component failover sequence, so I think we need more
time
to look at it.

Thanks,
Minh

On 23/01/17 12:28, Minh Hon Chau wrote:

 src/amf/amfnd/avnd_su.h |   1 +
 src/amf/amfnd/clc.cc|   3 ---
 src/amf/amfnd/di.cc |  12 +++-
 src/amf/amfnd/susm.cc   |  32 +---
 4 files changed, 41 insertions(+), 7 deletions(-)


In case component failover, faulty component will be terminated. When
the reinstantiation
is done, amfnd will send su_oper_message (enabled) to amfd which is
running along with
component failover. In the reported problem, if su_oper_message
(enabled) comes to amfd
before the quiesced assignment response (as part of component failover
sequence) comes to
amfd, then this quiesced assignment response is ignored, thus
component failover will not
finish.

The problem is in function susi_success_sg_realign with act=5,
state=3, amfd always assumes
su having faulty component is OUT_OF_SERVICE. This assumption is true
in most of the time
when su_oper_message (enabled) comes a little later than quiesced
assignment response. In fact
the su_oper_message (enabled) is not designed as part of component
failover sequence, thus it
can come any time during the failover. If amfd is getting a bit busier
with RTA update then
the faulty component has enough to reinstiantiate so that amfnd sends
su_oper_message (enabled)
before quiesced assignment response, the reported problem will be
seen.

This patch hardens the component failover sequence by ensuring the
su_oper_message (enabled) to
be sent after su completes to remove assignment. This approach comes
from the similarity in
su failover, where the su_oper_message (enabled) is sent in repair
phase.

diff --git a/src/amf/amfnd/avnd_su.h b/src/amf/amfnd/avnd_su.h
--- a/src/amf/amfnd/avnd_su.h
+++ b/src/amf/amfnd/avnd_su.h
@@ -393,6 +393,7 @@ extern struct avnd_su_si_rec *avnd_silis
 extern struct avnd_su_si_rec *avnd_silist_getprev(const struct
avnd_su_si_rec *);
 extern struct avnd_su_si_rec *avnd_silist_getlast(void);
 extern bool sufailover_in_progress(const AVND_SU *su);
+extern bool componentfailover_in_progress(const AVND_SU *su);
 extern bool sufailover_during_nodeswitchover(const AVND_SU *su);
 extern bool all_csis_in_removed_state(const AVND_SU *su);
 extern void su_reset_restart_count_in_comps(const struct avnd_cb_tag
*cb, const AVND_SU *su);
diff --git a/src/amf/amfnd/clc.cc b/src/amf/amfnd/clc.cc
--- a/src/amf/amfnd/clc.cc
+++ b/src/amf/amfnd/clc.cc
@@ -2381,9 +2381,6 @@ uint32_t avnd_comp_clc_terming_cleansucc
 (m_AVND_SU_IS_FAILOVER(su))) {
 /* yes, request director to orchestrate component failover */
 rc = avnd_di_oper

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-19 Thread minh chau
Hi Praveen,

Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233 
and #1902, it also can fix the problem.
Here we have 2 solutions:
- The one I sent for review is letting the failed component to be 
instantiated, I think it is current behavior. But one change is that 
amfnd will not report su operational message to amfd until amfnd 
finishes removing the assignment of (faulty) su which contains the 
failed component
- The V2 patch postpones the instantiation of failed component. amfnd 
will instantiate the failed component (via avnd_err_su_repair) after 
amfnd finishes removing the assignment of faulty su.

So basically the difference is the time that the failed component should 
be instantiated.

Still in item 3.11.1.3.2:
"In a 2N or N+M redundancy model, SI2 also needs to be switched over; 
other-wise, the number of active service units would be higher than what 
is allowed by the redundancy model. However, in an Nway redundancy 
model, SI2 could be left assigned to SU1 (if the saAmfSUFailover 
configuration attribute of the ser-vice unit is set to SA_FALSE), and a 
repair of C2 should be attempted by reinstantiating it. If the attempt 
to instantiate C2 fails, the service unit becomes disabled, and SI2 must 
be switched-over; however, if the attempt to instantiate C2 is 
successful, SI2 shall remain assigned to SU1, and based on other 
configuration parameters and N-way redundancy model semantics, even SI1 
might get reassigned to SU1."

My comment on V2:

The configuration in #2233 is different from the example in 
specification, but it sounds to me the attempt to instantiate failed 
component should be done as soon as possible.
The check in V2 patch means the failed component won't be instantiated 
if its SU still has any assignment. It should be true to 2N and N+M, but 
it's not for other SG. (As the example in specification, S2 does not 
have any CSI assigned to failed component C2). Moreover, in the clc.cc, 
amfnd does not check any of si_list.n_nodes, this probably is the logic 
that has being done so far.

Thanks,
Minh

On 17/02/17 23:16, praveen malviya wrote:
> Hi Minh,
>
> I think we should see this problem from fault management perspective 
> also. Here repair of failed component is performed before the 
> completion of recovery.In the problem, component faulted with 
> comp-failover recovery and it was successfully repaired(instantiated) 
> when SU switch-over was still pending.
>
> Now the question is: Why it was never observed earlier? The reason is 
> generally all components are assigned at least one CSI. In the present 
> configuration failed component was not assigned any CSI. When this 
> component was cleaned up and marked UNINSTANTIATED, AMFND sent 
> comp-failover recovery request to AMFD. But after sending recovery 
> request, it instantiated failed comp when SU has still assignments to 
> be switch-overed. The code related to this assumes that comp will have 
> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(), 
> TERMINATING to UNINSTANTIATED if block). For normal sequence of 
> comp-failover, su is repaired after removal of assignment in 
> avnd_su_si_oper_done() by calling avnd_err_su_repair().
>
> For 2N and N+M spec talks (3.11.1.3.2 Fail-Over Recovery Action page 
> 195) about switch-overing all the SIs of failed SU in case of 
> comp-failed recovery and not for other models. In current OpenSAF 
> implementation we are following this for all models.
>
> I think as a fix we should stop failed comp to get instantiated before 
> removal of assignments. For this the check in clc.cc can be hardened 
> to consider non-assigned comp failures.
> Attached is the patch (2233_v2.patch) based on this idea/approach.
>
> Thanks,
> Praveen
>
>
> On 17-Feb-17 1:19 PM, Minh Hon CHAU wrote:
>> Hi Praveen,
>>
>> Yes, you are right, I will update the description.
>>
>> Thanks, Minh
>>
>> Quoting praveen malviya :
>>
>>> Hi Minh,
>>>
>>> One quick question:
>>> Ticket description says:
>>> "Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends
>>> safSi=AmfDemoTwon"
>>> But logs are related to without SIdep. Also in the configuration
>>> app3_twon3su3si.xml, SI dep classes are commented.
>>> I think ticket description needs correction as problem is without SI 
>>> dep.
>>> Please confirm.
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 17-Feb-17 10:58 AM, praveen malviya wrote:
 Hi Minh,

 I have started reviewing this patch.

 Thanks,
 Praveen

 On 15-Feb-17 9:22 AM, minh chau wrote:
> Hi all,
>
> Have you had time to review this patch?
> It changes the component failover sequence, so I think we need more
> time
> to look at it.
>
> Thanks,
> Minh
>
> On 23/01/17 12:28, Minh Hon Chau wrote:
>>  src/amf/amfnd/avnd_su.h |   1 +
>>  src/amf/amfnd/clc.cc|   3 ---
>>  src/amf/amfnd/di.cc |  12 +++-
>>  src/amf/amfnd/susm.cc   |  32 +

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-20 Thread praveen malviya
Hi Minh,

Please find my response inline with [Praveen].

Thanks,
Praveen

On 20-Feb-17 6:58 AM, minh chau wrote:
> Hi Praveen,
>
> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
> and #1902, it also can fix the problem.
> Here we have 2 solutions:
> - The one I sent for review is letting the failed component to be
> instantiated, I think it is current behavior. But one change is that
> amfnd will not report su operational message to amfd until amfnd
> finishes removing the assignment of (faulty) su which contains the
> failed component
> - The V2 patch postpones the instantiation of failed component. amfnd
> will instantiate the failed component (via avnd_err_su_repair) after
> amfnd finishes removing the assignment of faulty su.
>
> So basically the difference is the time that the failed component should
> be instantiated.
>
> Still in item 3.11.1.3.2:
> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
> other-wise, the number of active service units would be higher than what
> is allowed by the redundancy model. However, in an Nway redundancy
> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
> repair of C2 should be attempted by reinstantiating it. If the attempt
> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
> be switched-over; however, if the attempt to instantiate C2 is
> successful, SI2 shall remain assigned to SU1, and based on other
> configuration parameters and N-way redundancy model semantics, even SI1
> might get reassigned to SU1."
>
> My comment on V2:
>
> The configuration in #2233 is different from the example in
> specification, but it sounds to me the attempt to instantiate failed
> component should be done as soon as possible.
> The check in V2 patch means the failed component won't be instantiated
> if its SU still has any assignment. It should be true to 2N and N+M, but
> it's not for other SG. (As the example in specification, S2 does not
> have any CSI assigned to failed component C2).
[Praveen]As of now we have documented in the PR doc (conformance table 
section 3.11.1.3 Recovery) that if a component faults with comp-failover 
recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active and 
N+M models also. This is just to highlight about other red models. But 
this documentation is not clear for an unassigned comp.
But from the beginning, comp-failover is working this way only. At-least 
from clean up perspective we have fixed the problem of parallelism in 
the past in the ticket #474.

One more thing I have noted, proxy-proxied implementation is based on 
B.01.01. As per B.01.01, proxy will register himself and its proxied as 
soon as it gets instantiated. In a configuration containing both proxy 
and proxied comp, if the proxy does not get any CSI and it faults with 
comp-failover recovery then in instantiation phase it may again register 
its proxy. I think proxy in other SU should register its proxied. I 
guess, from deployment perspective such a configuration in which a user 
configures proxy without any CSI may not exists and only possibility is 
an application modeling a legacy code in NoRed model. However, in the 
later version of spec B.01.02, proxied was supposed to mention the name 
of proxy CSI and thus proxy should register only when its get proxy CSI.

One more point to be noted comp-failover can also be done as a part of 
escalation also. If a component is instantiated before the completion of 
comp-failover recovery and if faults again then it may escalate to 
node-failover before completion of comp-failover recovery.

Since in spec there is no specific discussion for comp-failover recovery 
for an unassigned comp, I will encourage other maintainers also to 
provide inputs.


Thanks,
Praveen


  Moreover, in the clc.cc,
> amfnd does not check any of si_list.n_nodes, this probably is the logic
> that has being done so far.
>
> Thanks,
> Minh
>
> On 17/02/17 23:16, praveen malviya wrote:
>> Hi Minh,
>>
>> I think we should see this problem from fault management perspective
>> also. Here repair of failed component is performed before the
>> completion of recovery.In the problem, component faulted with
>> comp-failover recovery and it was successfully repaired(instantiated)
>> when SU switch-over was still pending.
>>
>> Now the question is: Why it was never observed earlier? The reason is
>> generally all components are assigned at least one CSI. In the present
>> configuration failed component was not assigned any CSI. When this
>> component was cleaned up and marked UNINSTANTIATED, AMFND sent
>> comp-failover recovery request to AMFD. But after sending recovery
>> request, it instantiated failed comp when SU has still assignments to
>> be switch-overed. The code related to this assumes that comp will have
>> at-least one CSI assigned to it (clc.cc avnd_comp_clc_st_chng_prc(),
>> TERMINATING to U

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-21 Thread minh chau
Hi Praveen,

Please find my response with [Minh2]

Thanks,
Minh

On 21/02/17 16:45, praveen malviya wrote:
> Hi Minh,
>
> Please find my response inline with [Praveen].
>
> Thanks,
> Praveen
>
> On 20-Feb-17 6:58 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
>> and #1902, it also can fix the problem.
>> Here we have 2 solutions:
>> - The one I sent for review is letting the failed component to be
>> instantiated, I think it is current behavior. But one change is that
>> amfnd will not report su operational message to amfd until amfnd
>> finishes removing the assignment of (faulty) su which contains the
>> failed component
>> - The V2 patch postpones the instantiation of failed component. amfnd
>> will instantiate the failed component (via avnd_err_su_repair) after
>> amfnd finishes removing the assignment of faulty su.
>>
>> So basically the difference is the time that the failed component should
>> be instantiated.
>>
>> Still in item 3.11.1.3.2:
>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
>> other-wise, the number of active service units would be higher than what
>> is allowed by the redundancy model. However, in an Nway redundancy
>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
>> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
>> repair of C2 should be attempted by reinstantiating it. If the attempt
>> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
>> be switched-over; however, if the attempt to instantiate C2 is
>> successful, SI2 shall remain assigned to SU1, and based on other
>> configuration parameters and N-way redundancy model semantics, even SI1
>> might get reassigned to SU1."
>>
>> My comment on V2:
>>
>> The configuration in #2233 is different from the example in
>> specification, but it sounds to me the attempt to instantiate failed
>> component should be done as soon as possible.
>> The check in V2 patch means the failed component won't be instantiated
>> if its SU still has any assignment. It should be true to 2N and N+M, but
>> it's not for other SG. (As the example in specification, S2 does not
>> have any CSI assigned to failed component C2).
> [Praveen]As of now we have documented in the PR doc (conformance table 
> section 3.11.1.3 Recovery) that if a component faults with 
> comp-failover recovery then AMFD switch-overs the whole SU for N-Way, 
> N-Way Active and N+M models also. This is just to highlight about 
> other red models. But this documentation is not clear for an 
> unassigned comp.
> But from the beginning, comp-failover is working this way only. 
> At-least from clean up perspective we have fixed the problem of 
> parallelism in the past in the ticket #474.
>
> One more thing I have noted, proxy-proxied implementation is based on 
> B.01.01. As per B.01.01, proxy will register himself and its proxied 
> as soon as it gets instantiated. In a configuration containing both 
> proxy and proxied comp, if the proxy does not get any CSI and it 
> faults with comp-failover recovery then in instantiation phase it may 
> again register its proxy. I think proxy in other SU should register 
> its proxied. I guess, from deployment perspective such a configuration 
> in which a user configures proxy without any CSI may not exists and 
> only possibility is an application modeling a legacy code in NoRed 
> model. However, in the later version of spec B.01.02, proxied was 
> supposed to mention the name of proxy CSI and thus proxy should 
> register only when its get proxy CSI.
>
> One more point to be noted comp-failover can also be done as a part of 
> escalation also. If a component is instantiated before the completion 
> of comp-failover recovery and if faults again then it may escalate to 
> node-failover before completion of comp-failover recovery.
>
> Since in spec there is no specific discussion for comp-failover 
> recovery for an unassigned comp, I will encourage other maintainers 
> also to provide inputs.
[Minh2] Yes this was my worry too, that I could break something that has 
been working for long time in this area. If you look at component 
failover in successful case, the su operational message is always 
reported to amfd at the time amfnd completes removing assignment of su 
that hosts failed component. But there is no mechanism for now to 
guarantee that message always is sent in such order. So I only intended 
to bring the su operational message to be sent at the spot as it is 
currently working in successful case, and try not to touch everything 
else. I think the patch also aligns with AMFD's code which is seeing the 
su of failed component as an out-of-service su.
Could you please tell me if any problem with my intention? If you agree 
then we could look into the patch if some code could be improved. Amfnd 
has a queue there that probably can buffer the su operational message 
until removing assignme

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-21 Thread praveen malviya
Hi Minh,

Please see response with [Praveen]

Thanks
Praveen

On 22-Feb-17 5:39 AM, minh chau wrote:
> Hi Praveen,
>
> Please find my response with [Minh2]
>
> Thanks,
> Minh
>
> On 21/02/17 16:45, praveen malviya wrote:
>> Hi Minh,
>>
>> Please find my response inline with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>> On 20-Feb-17 6:58 AM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
>>> and #1902, it also can fix the problem.
>>> Here we have 2 solutions:
>>> - The one I sent for review is letting the failed component to be
>>> instantiated, I think it is current behavior. But one change is that
>>> amfnd will not report su operational message to amfd until amfnd
>>> finishes removing the assignment of (faulty) su which contains the
>>> failed component
>>> - The V2 patch postpones the instantiation of failed component. amfnd
>>> will instantiate the failed component (via avnd_err_su_repair) after
>>> amfnd finishes removing the assignment of faulty su.
>>>
>>> So basically the difference is the time that the failed component should
>>> be instantiated.
>>>
>>> Still in item 3.11.1.3.2:
>>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
>>> other-wise, the number of active service units would be higher than what
>>> is allowed by the redundancy model. However, in an Nway redundancy
>>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
>>> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
>>> repair of C2 should be attempted by reinstantiating it. If the attempt
>>> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
>>> be switched-over; however, if the attempt to instantiate C2 is
>>> successful, SI2 shall remain assigned to SU1, and based on other
>>> configuration parameters and N-way redundancy model semantics, even SI1
>>> might get reassigned to SU1."
>>>
>>> My comment on V2:
>>>
>>> The configuration in #2233 is different from the example in
>>> specification, but it sounds to me the attempt to instantiate failed
>>> component should be done as soon as possible.
>>> The check in V2 patch means the failed component won't be instantiated
>>> if its SU still has any assignment. It should be true to 2N and N+M, but
>>> it's not for other SG. (As the example in specification, S2 does not
>>> have any CSI assigned to failed component C2).
>> [Praveen]As of now we have documented in the PR doc (conformance table
>> section 3.11.1.3 Recovery) that if a component faults with
>> comp-failover recovery then AMFD switch-overs the whole SU for N-Way,
>> N-Way Active and N+M models also. This is just to highlight about
>> other red models. But this documentation is not clear for an
>> unassigned comp.
>> But from the beginning, comp-failover is working this way only.
>> At-least from clean up perspective we have fixed the problem of
>> parallelism in the past in the ticket #474.
>>
>> One more thing I have noted, proxy-proxied implementation is based on
>> B.01.01. As per B.01.01, proxy will register himself and its proxied
>> as soon as it gets instantiated. In a configuration containing both
>> proxy and proxied comp, if the proxy does not get any CSI and it
>> faults with comp-failover recovery then in instantiation phase it may
>> again register its proxy. I think proxy in other SU should register
>> its proxied. I guess, from deployment perspective such a configuration
>> in which a user configures proxy without any CSI may not exists and
>> only possibility is an application modeling a legacy code in NoRed
>> model. However, in the later version of spec B.01.02, proxied was
>> supposed to mention the name of proxy CSI and thus proxy should
>> register only when its get proxy CSI.
>>
>> One more point to be noted comp-failover can also be done as a part of
>> escalation also. If a component is instantiated before the completion
>> of comp-failover recovery and if faults again then it may escalate to
>> node-failover before completion of comp-failover recovery.
>>
>> Since in spec there is no specific discussion for comp-failover
>> recovery for an unassigned comp, I will encourage other maintainers
>> also to provide inputs.
> [Minh2] Yes this was my worry too, that I could break something that has
> been working for long time in this area. If you look at component
> failover in successful case, the su operational message is always
> reported to amfd at the time amfnd completes removing assignment of su
> that hosts failed component. But there is no mechanism for now to
> guarantee that message always is sent in such order. So I only intended
> to bring the su operational message to be sent at the spot as it is
> currently working in successful case, and try not to touch everything
> else. I think the patch also aligns with AMFD's code which is seeing the
> su of failed component as an out-of-service su.
> Could you please tell me if any problem with my intention?

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-22 Thread Nagendra Kumar
>> Since in spec there is no specific discussion for comp-failover recovery for 
>> an unassigned comp, I will encourage other maintainers also to provide 
>> inputs.

I do agree for not instantiating failed component before recovery, this keeps 
the approach similar to SU failover also.

@Minh: If you don't mind, we can take su oper state changes in an enhancement. 
What do you say ?

Thanks
-Nagu

> -Original Message-
> From: praveen malviya
> Sent: 21 February 2017 11:15
> To: minh chau
> Cc: hans.nordeb...@ericsson.com; Nagendra Kumar;
> gary@dektech.com.au; long.hb.ngu...@dektech.com.au; opensaf-
> de...@lists.sourceforge.net
> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message
> synchronizes with component failover sequence [#2233]
> 
> Hi Minh,
> 
> Please find my response inline with [Praveen].
> 
> Thanks,
> Praveen
> 
> On 20-Feb-17 6:58 AM, minh chau wrote:
> > Hi Praveen,
> >
> > Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
> > and #1902, it also can fix the problem.
> > Here we have 2 solutions:
> > - The one I sent for review is letting the failed component to be
> > instantiated, I think it is current behavior. But one change is that
> > amfnd will not report su operational message to amfd until amfnd
> > finishes removing the assignment of (faulty) su which contains the
> > failed component
> > - The V2 patch postpones the instantiation of failed component. amfnd
> > will instantiate the failed component (via avnd_err_su_repair) after
> > amfnd finishes removing the assignment of faulty su.
> >
> > So basically the difference is the time that the failed component should
> > be instantiated.
> >
> > Still in item 3.11.1.3.2:
> > "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
> > other-wise, the number of active service units would be higher than what
> > is allowed by the redundancy model. However, in an Nway redundancy
> > model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
> > configuration attribute of the ser-vice unit is set to SA_FALSE), and a
> > repair of C2 should be attempted by reinstantiating it. If the attempt
> > to instantiate C2 fails, the service unit becomes disabled, and SI2 must
> > be switched-over; however, if the attempt to instantiate C2 is
> > successful, SI2 shall remain assigned to SU1, and based on other
> > configuration parameters and N-way redundancy model semantics, even
> SI1
> > might get reassigned to SU1."
> >
> > My comment on V2:
> >
> > The configuration in #2233 is different from the example in
> > specification, but it sounds to me the attempt to instantiate failed
> > component should be done as soon as possible.
> > The check in V2 patch means the failed component won't be instantiated
> > if its SU still has any assignment. It should be true to 2N and N+M, but
> > it's not for other SG. (As the example in specification, S2 does not
> > have any CSI assigned to failed component C2).
> [Praveen]As of now we have documented in the PR doc (conformance table
> section 3.11.1.3 Recovery) that if a component faults with comp-failover
> recovery then AMFD switch-overs the whole SU for N-Way, N-Way Active
> and
> N+M models also. This is just to highlight about other red models. But
> this documentation is not clear for an unassigned comp.
> But from the beginning, comp-failover is working this way only. At-least
> from clean up perspective we have fixed the problem of parallelism in
> the past in the ticket #474.
> 
> One more thing I have noted, proxy-proxied implementation is based on
> B.01.01. As per B.01.01, proxy will register himself and its proxied as
> soon as it gets instantiated. In a configuration containing both proxy
> and proxied comp, if the proxy does not get any CSI and it faults with
> comp-failover recovery then in instantiation phase it may again register
> its proxy. I think proxy in other SU should register its proxied. I
> guess, from deployment perspective such a configuration in which a user
> configures proxy without any CSI may not exists and only possibility is
> an application modeling a legacy code in NoRed model. However, in the
> later version of spec B.01.02, proxied was supposed to mention the name
> of proxy CSI and thus proxy should register only when its get proxy CSI.
> 
> One more point to be noted comp-failover can also be done as a part of
> escalation also. If a component is instantiated before the completion of
> comp-failover recovery and if faults again then it may escalate to
> node-failover before completion

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-22 Thread minh chau
Hi Nagu, Praveen,

Please find my comment in [Minh3]

Thanks,
Minh

On 22/02/17 19:34, Nagendra Kumar wrote:
>>> Since in spec there is no specific discussion for comp-failover recovery 
>>> for an unassigned comp, I will encourage other maintainers also to provide 
>>> inputs.
> I do agree for not instantiating failed component before recovery, this keeps 
> the approach similar to SU failover also.
[Minh3]: There's one example of component failover that I would like us 
to have a look
- 2N application, SU4/SU5 has active/standby assignment respectively, 
each SU has 3 components
- Add a sleep of 10 seconds in clc script start command of first 
component C41 of SU4
Steps:
1- Kill C41 to trigger component failover
2- SU4 goes for quiesced assignment
3- SU5 goes for active assignment
4- SU4 is removed its assignment
5- Now there's a pause of 10 seconds due to clc script start, to ensure 
that C41 is healthy
6- Next SU4 has standby assignment.

 From the above example, I think we can see some problems if the 
re-instantiation of C41 is delayed:
- Because C41 is faulty, it needs to be restarted ok because its SU has 
assignment
- Moving re-instantiation of C41 is further down that means the recovery 
will take longer
- What if re-instantiation of C41 leads to instantation-failed

Whether or not the C41 has assignment or is unassigned, the 
OperState/PresenceState result from re-instantiation of faulty C41 
affects to SU4's eligibility for assignment.
There's a parallelism between [restart of faulty component C41] and 
[movement from Active->Quiesced->Removed assignment of SU4], it's good 
to have and it's current behavior of amfnd.
It's my understanding so far but I could be wrong. Let's check with Hans 
and Gary.

>
> @Minh: If you don't mind, we can take su oper state changes in an 
> enhancement. What do you say ?
[Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 release.
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: praveen malviya
>> Sent: 21 February 2017 11:15
>> To: minh chau
>> Cc: hans.nordeb...@ericsson.com; Nagendra Kumar;
>> gary....@dektech.com.au; long.hb.ngu...@dektech.com.au; opensaf-
>> de...@lists.sourceforge.net
>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message
>> synchronizes with component failover sequence [#2233]
>>
>> Hi Minh,
>>
>> Please find my response inline with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>> On 20-Feb-17 6:58 AM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
>>> and #1902, it also can fix the problem.
>>> Here we have 2 solutions:
>>> - The one I sent for review is letting the failed component to be
>>> instantiated, I think it is current behavior. But one change is that
>>> amfnd will not report su operational message to amfd until amfnd
>>> finishes removing the assignment of (faulty) su which contains the
>>> failed component
>>> - The V2 patch postpones the instantiation of failed component. amfnd
>>> will instantiate the failed component (via avnd_err_su_repair) after
>>> amfnd finishes removing the assignment of faulty su.
>>>
>>> So basically the difference is the time that the failed component should
>>> be instantiated.
>>>
>>> Still in item 3.11.1.3.2:
>>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
>>> other-wise, the number of active service units would be higher than what
>>> is allowed by the redundancy model. However, in an Nway redundancy
>>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
>>> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
>>> repair of C2 should be attempted by reinstantiating it. If the attempt
>>> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
>>> be switched-over; however, if the attempt to instantiate C2 is
>>> successful, SI2 shall remain assigned to SU1, and based on other
>>> configuration parameters and N-way redundancy model semantics, even
>> SI1
>>> might get reassigned to SU1."
>>>
>>> My comment on V2:
>>>
>>> The configuration in #2233 is different from the example in
>>> specification, but it sounds to me the attempt to instantiate failed
>>> component should be done as soon as possible.
>>> The check in V2 patch means the failed component won't be instantiated
>>> if its SU still has any assignment. It should be true to 2N and N+M, but
>>>

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-02-28 Thread Gary Lee
Hi

It seems the component should be re-instantiated if it has no CSI. Whether or 
not there is an SI assigned should be irrelevant?

Thanks
Gary

-Original Message-
From: minh chau 
Date: Thursday, 23 February 2017 at 3:16 pm
To: Nagendra Kumar , Praveen Malviya 

Cc: , gary , 
, 
Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message 
synchronizes with component failover sequence [#2233]

Hi Nagu, Praveen,

Please find my comment in [Minh3]

Thanks,
Minh

On 22/02/17 19:34, Nagendra Kumar wrote:
>>> Since in spec there is no specific discussion for comp-failover 
recovery for an unassigned comp, I will encourage other maintainers also to 
provide inputs.
> I do agree for not instantiating failed component before recovery, this 
keeps the approach similar to SU failover also.
[Minh3]: There's one example of component failover that I would like us 
to have a look
- 2N application, SU4/SU5 has active/standby assignment respectively, 
each SU has 3 components
- Add a sleep of 10 seconds in clc script start command of first 
component C41 of SU4
Steps:
1- Kill C41 to trigger component failover
2- SU4 goes for quiesced assignment
3- SU5 goes for active assignment
4- SU4 is removed its assignment
5- Now there's a pause of 10 seconds due to clc script start, to ensure 
that C41 is healthy
6- Next SU4 has standby assignment.

 From the above example, I think we can see some problems if the 
re-instantiation of C41 is delayed:
- Because C41 is faulty, it needs to be restarted ok because its SU has 
assignment
- Moving re-instantiation of C41 is further down that means the recovery 
will take longer
- What if re-instantiation of C41 leads to instantation-failed

Whether or not the C41 has assignment or is unassigned, the 
OperState/PresenceState result from re-instantiation of faulty C41 
affects to SU4's eligibility for assignment.
There's a parallelism between [restart of faulty component C41] and 
[movement from Active->Quiesced->Removed assignment of SU4], it's good 
to have and it's current behavior of amfnd.
It's my understanding so far but I could be wrong. Let's check with Hans 
and Gary.

>
> @Minh: If you don't mind, we can take su oper state changes in an 
enhancement. What do you say ?
[Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 release.
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: praveen malviya
>> Sent: 21 February 2017 11:15
>> To: minh chau
>> Cc: hans.nordeb...@ericsson.com; Nagendra Kumar;
>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au; opensaf-
>> de...@lists.sourceforge.net
    >> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message
>> synchronizes with component failover sequence [#2233]
>>
>> Hi Minh,
>>
>> Please find my response inline with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>> On 20-Feb-17 6:58 AM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Thanks for your V2 patch, I have tested V2 in scenario of ticket #2233
>>> and #1902, it also can fix the problem.
>>> Here we have 2 solutions:
>>> - The one I sent for review is letting the failed component to be
>>> instantiated, I think it is current behavior. But one change is that
>>> amfnd will not report su operational message to amfd until amfnd
>>> finishes removing the assignment of (faulty) su which contains the
>>> failed component
>>> - The V2 patch postpones the instantiation of failed component. amfnd
>>> will instantiate the failed component (via avnd_err_su_repair) after
>>> amfnd finishes removing the assignment of faulty su.
>>>
>>> So basically the difference is the time that the failed component should
>>> be instantiated.
>>>
>>> Still in item 3.11.1.3.2:
>>> "In a 2N or N+M redundancy model, SI2 also needs to be switched over;
>>> other-wise, the number of active service units would be higher than what
>>> is allowed by the redundancy model. However, in an Nway redundancy
>>> model, SI2 could be left assigned to SU1 (if the saAmfSUFailover
>>> configuration attribute of the ser-vice unit is set to SA_FALSE), and a
>>> repair of C2 should be attempted by reinstantiating it. If the attempt
>>> to instantiate C2 fails, the service unit becomes disabled, and SI2 must
  

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-02 Thread minh chau
Hi,

Thanks Gary.
@Nagu, Praveen: Have you had time to check the example in my previous email?
The ticket #2179 is about to document that full escalation is supported 
for SC absence feature, it is waiting for fix of #2233.
I think there's not big change in code for #2233, it's a matter of 
decision to make for re-instantiation of failed component.

Thanks,
Minh

On 01/03/17 15:42, Gary Lee wrote:
> Hi
>
> It seems the component should be re-instantiated if it has no CSI. Whether or 
> not there is an SI assigned should be irrelevant?
>
> Thanks
> Gary
>
> -Original Message-
> From: minh chau 
> Date: Thursday, 23 February 2017 at 3:16 pm
> To: Nagendra Kumar , Praveen Malviya 
> 
> Cc: , gary , 
> , 
> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message 
> synchronizes with component failover sequence [#2233]
>
>  Hi Nagu, Praveen,
>  
>  Please find my comment in [Minh3]
>  
>  Thanks,
>  Minh
>  
>  On 22/02/17 19:34, Nagendra Kumar wrote:
>  >>> Since in spec there is no specific discussion for comp-failover 
> recovery for an unassigned comp, I will encourage other maintainers also to 
> provide inputs.
>  > I do agree for not instantiating failed component before recovery, 
> this keeps the approach similar to SU failover also.
>  [Minh3]: There's one example of component failover that I would like us
>  to have a look
>  - 2N application, SU4/SU5 has active/standby assignment respectively,
>  each SU has 3 components
>  - Add a sleep of 10 seconds in clc script start command of first
>  component C41 of SU4
>  Steps:
>  1- Kill C41 to trigger component failover
>  2- SU4 goes for quiesced assignment
>  3- SU5 goes for active assignment
>  4- SU4 is removed its assignment
>  5- Now there's a pause of 10 seconds due to clc script start, to ensure
>  that C41 is healthy
>  6- Next SU4 has standby assignment.
>  
>   From the above example, I think we can see some problems if the
>  re-instantiation of C41 is delayed:
>  - Because C41 is faulty, it needs to be restarted ok because its SU has
>  assignment
>  - Moving re-instantiation of C41 is further down that means the recovery
>  will take longer
>  - What if re-instantiation of C41 leads to instantation-failed
>  
>  Whether or not the C41 has assignment or is unassigned, the
>  OperState/PresenceState result from re-instantiation of faulty C41
>  affects to SU4's eligibility for assignment.
>  There's a parallelism between [restart of faulty component C41] and
>  [movement from Active->Quiesced->Removed assignment of SU4], it's good
>  to have and it's current behavior of amfnd.
>  It's my understanding so far but I could be wrong. Let's check with Hans
>  and Gary.
>  
>  >
>  > @Minh: If you don't mind, we can take su oper state changes in an 
> enhancement. What do you say ?
>  [Minh3]: Enhancement is ok, but I hope we can have this fix in 5.2 
> release.
>  >
>  > Thanks
>  > -Nagu
>  >
>  >> -----Original Message-----
>  >> From: praveen malviya
>  >> Sent: 21 February 2017 11:15
>  >> To: minh chau
>  >> Cc: hans.nordeb...@ericsson.com; Nagendra Kumar;
>  >> gary@dektech.com.au; long.hb.ngu...@dektech.com.au; opensaf-
>  >> de...@lists.sourceforge.net
>  >> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational 
> message
>  >> synchronizes with component failover sequence [#2233]
>  >>
>  >> Hi Minh,
>  >>
>  >> Please find my response inline with [Praveen].
>  >>
>  >> Thanks,
>  >> Praveen
>  >>
>  >> On 20-Feb-17 6:58 AM, minh chau wrote:
>  >>> Hi Praveen,
>  >>>
>  >>> Thanks for your V2 patch, I have tested V2 in scenario of ticket 
> #2233
>  >>> and #1902, it also can fix the problem.
>  >>> Here we have 2 solutions:
>  >>> - The one I sent for review is letting the failed component to be
>  >>> instantiated, I think it is current behavior. But one change is that
>  >>> amfnd will not report su operational message to amfd until amfnd
>  >>> finishes removing the assignment of (faulty) su which contains the
>  >>> failed component
>  >>> - The V2 patch postpones the instantiation of fa

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-02 Thread praveen malviya
Hi Minh,
Please see response with [Praveen].

Thanks,
Praveen



On 02-Mar-17 1:43 PM, minh chau wrote:
> Hi,
>
> Thanks Gary.
> @Nagu, Praveen: Have you had time to check the example in my previous
> email?
> The ticket #2179 is about to document that full escalation is supported
> for SC absence feature, it is waiting for fix of #2233.
> I think there's not big change in code for #2233, it's a matter of
> decision to make for re-instantiation of failed component.
>
> Thanks,
> Minh
>
> On 01/03/17 15:42, Gary Lee wrote:
>> Hi
>>
>> It seems the component should be re-instantiated if it has no CSI.
>> Whether or not there is an SI assigned should be irrelevant?
>>
>> Thanks
>> Gary
>>
>> -Original Message-
>> From: minh chau 
>> Date: Thursday, 23 February 2017 at 3:16 pm
>> To: Nagendra Kumar , Praveen Malviya
>> 
>> Cc: , gary ,
>> , 
>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational
>> message synchronizes with component failover sequence [#2233]
>>
>>  Hi Nagu, Praveen,
>>   Please find my comment in [Minh3]
>>   Thanks,
>>  Minh
>>   On 22/02/17 19:34, Nagendra Kumar wrote:
>>  >>> Since in spec there is no specific discussion for
>> comp-failover recovery for an unassigned comp, I will encourage other
>> maintainers also to provide inputs.
>>  > I do agree for not instantiating failed component before
>> recovery, this keeps the approach similar to SU failover also.
>>  [Minh3]: There's one example of component failover that I would
>> like us
>>  to have a look
>>  - 2N application, SU4/SU5 has active/standby assignment
>> respectively,
>>  each SU has 3 components
>>  - Add a sleep of 10 seconds in clc script start command of first
>>  component C41 of SU4
>>  Steps:
>>  1- Kill C41 to trigger component failover
>>  2- SU4 goes for quiesced assignment
>>  3- SU5 goes for active assignment
>>  4- SU4 is removed its assignment
>>  5- Now there's a pause of 10 seconds due to clc script start, to
>> ensure
>>  that C41 is healthy
>>  6- Next SU4 has standby assignment.
>>From the above example, I think we can see some problems if
>> the
>>  re-instantiation of C41 is delayed:
>>  - Because C41 is faulty, it needs to be restarted ok because its
>> SU has
>>  assignment
>>  - Moving re-instantiation of C41 is further down that means the
>> recovery
>>  will take longer
>>  - What if re-instantiation of C41 leads to instantation-failed
[Praveen] If AMFND re-instantiate C41 after removal of assignment and it 
moves to instantiation-failed then:
-Node will be rebooted if nodefailfastonterminationfaioure=true.
-ifnodefailfastonterminationfaioure=false then as per section 4.6 page 
212, SU will be marked INST_FAILED and AMF will have to terminate all 
the components. Termination of other components will be easier if they 
do not have assignments or pending assignments.

If C41 is instantiated before removal of assignments and it moves to 
INST_FAILED state, then AMFND will be terminating other comps of SU when 
they are in the middle of quiesced or removal of assignment. So a 
component will having different orders of quiesced/removal/terminate 
callbacks in its mailbox. This will make thing complex.

>   Whether or not the C41 has assignment or is unassigned, the
>>  OperState/PresenceState result from re-instantiation of faulty C41
>>  affects to SU4's eligibility for assignment.
[Praveen] Here Su4 will get only fresh assignments after C4 gets 
enabled. For fresh assignments, AMF can choose any of the spare SUs 
available and Su4 will be chosen based on ranks.

At the same time, AMF spec encourages not to choose faulty SUs soon for 
assignments. It is highlighted in SG Auto adjust feature context in 
section 3.6.1.2 Initiation of the Auto-Adjust Procedure for a Service Group:
"
However, if the completion of a recovery/repair operation
has made the service group eligible for auto-adjustment (for example, if 
a node joins the cluster after the repair), it is not so wise to run the 
auto-adjust procedure for the service group involving the newly repaired 
service units immediately. Thus, the service
group-level configuration attribute auto-adjust probation period has 
been introduced (actually, the saAmfSGAutoAdjustProb configuration 
attribute in the SaAmfSG object class, shown in Section 8.9). When a 
service unit becomes available for auto-adjustment after a 
repair/recovery operation, the service unit enter

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-03 Thread minh chau
Hi Praveen,

I have two comments with [Minh4].

Thanks
Minh

On 02/03/17 20:49, praveen malviya wrote:
> Hi Minh,
> Please see response with [Praveen].
>
> Thanks,
> Praveen
>
>
>
> On 02-Mar-17 1:43 PM, minh chau wrote:
>> Hi,
>>
>> Thanks Gary.
>> @Nagu, Praveen: Have you had time to check the example in my previous
>> email?
>> The ticket #2179 is about to document that full escalation is supported
>> for SC absence feature, it is waiting for fix of #2233.
>> I think there's not big change in code for #2233, it's a matter of
>> decision to make for re-instantiation of failed component.
>>
>> Thanks,
>> Minh
>>
>> On 01/03/17 15:42, Gary Lee wrote:
>>> Hi
>>>
>>> It seems the component should be re-instantiated if it has no CSI.
>>> Whether or not there is an SI assigned should be irrelevant?
>>>
>>> Thanks
>>> Gary
>>>
>>> -----Original Message-
>>> From: minh chau 
>>> Date: Thursday, 23 February 2017 at 3:16 pm
>>> To: Nagendra Kumar , Praveen Malviya
>>> 
>>> Cc: , gary ,
>>> , 
>>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational
>>> message synchronizes with component failover sequence [#2233]
>>>
>>>  Hi Nagu, Praveen,
>>>   Please find my comment in [Minh3]
>>>   Thanks,
>>>  Minh
>>>   On 22/02/17 19:34, Nagendra Kumar wrote:
>>>  >>> Since in spec there is no specific discussion for
>>> comp-failover recovery for an unassigned comp, I will encourage other
>>> maintainers also to provide inputs.
>>>  > I do agree for not instantiating failed component before
>>> recovery, this keeps the approach similar to SU failover also.
>>>  [Minh3]: There's one example of component failover that I would
>>> like us
>>>  to have a look
>>>  - 2N application, SU4/SU5 has active/standby assignment
>>> respectively,
>>>  each SU has 3 components
>>>  - Add a sleep of 10 seconds in clc script start command of first
>>>  component C41 of SU4
>>>  Steps:
>>>  1- Kill C41 to trigger component failover
>>>  2- SU4 goes for quiesced assignment
>>>  3- SU5 goes for active assignment
>>>  4- SU4 is removed its assignment
>>>  5- Now there's a pause of 10 seconds due to clc script start, to
>>> ensure
>>>  that C41 is healthy
>>>  6- Next SU4 has standby assignment.
>>>From the above example, I think we can see some problems if
>>> the
>>>  re-instantiation of C41 is delayed:
>>>  - Because C41 is faulty, it needs to be restarted ok because its
>>> SU has
>>>  assignment
>>>  - Moving re-instantiation of C41 is further down that means the
>>> recovery
>>>  will take longer
>>>  - What if re-instantiation of C41 leads to instantation-failed
> [Praveen] If AMFND re-instantiate C41 after removal of assignment and 
> it moves to instantiation-failed then:
> -Node will be rebooted if nodefailfastonterminationfaioure=true.
> -ifnodefailfastonterminationfaioure=false then as per section 4.6 page 
> 212, SU will be marked INST_FAILED and AMF will have to terminate all 
> the components. Termination of other components will be easier if they 
> do not have assignments or pending assignments.
>
> If C41 is instantiated before removal of assignments and it moves to 
> INST_FAILED state, then AMFND will be terminating other comps of SU 
> when they are in the middle of quiesced or removal of assignment. So a 
> component will having different orders of quiesced/removal/terminate 
> callbacks in its mailbox. This will make thing complex.
[Minh4]: I am not sure if I understand the complex thing you mentioned 
as it has been working like this for long time. If we are going to 
change the current behavior to the way that amfnd will instantiate 
failed component after removal assignment, then I think it should be 
addressed in another enhancement ticket. The complex thing in current 
behavior could be improved/removed if we change to another behavior. It 
looks like a big change not just in the code, also backward compatible 
consideration. At this moment, let's fix the message ordering problem of 
existing code/design (you already agreed?). I can create another 
enhancement/discussion ticket for matter of instantiation of failed 
component, from there more evidence of specs will be added, ... What do 
you thi

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-05 Thread praveen malviya
Hi Minh,

Please see inline with [Praveen].

Thanks,
Praveen

On 03-Mar-17 5:39 PM, minh chau wrote:
> Hi Praveen,
>
> I have two comments with [Minh4].
>
> Thanks
> Minh
>
> On 02/03/17 20:49, praveen malviya wrote:
>> Hi Minh,
>> Please see response with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>>
>>
>> On 02-Mar-17 1:43 PM, minh chau wrote:
>>> Hi,
>>>
>>> Thanks Gary.
>>> @Nagu, Praveen: Have you had time to check the example in my previous
>>> email?
>>> The ticket #2179 is about to document that full escalation is supported
>>> for SC absence feature, it is waiting for fix of #2233.
>>> I think there's not big change in code for #2233, it's a matter of
>>> decision to make for re-instantiation of failed component.
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 01/03/17 15:42, Gary Lee wrote:
>>>> Hi
>>>>
>>>> It seems the component should be re-instantiated if it has no CSI.
>>>> Whether or not there is an SI assigned should be irrelevant?
>>>>
>>>> Thanks
>>>> Gary
>>>>
>>>> -Original Message-
>>>> From: minh chau 
>>>> Date: Thursday, 23 February 2017 at 3:16 pm
>>>> To: Nagendra Kumar , Praveen Malviya
>>>> 
>>>> Cc: , gary ,
>>>> , 
>>>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational
>>>> message synchronizes with component failover sequence [#2233]
>>>>
>>>>  Hi Nagu, Praveen,
>>>>   Please find my comment in [Minh3]
>>>>   Thanks,
>>>>  Minh
>>>>   On 22/02/17 19:34, Nagendra Kumar wrote:
>>>>  >>> Since in spec there is no specific discussion for
>>>> comp-failover recovery for an unassigned comp, I will encourage other
>>>> maintainers also to provide inputs.
>>>>  > I do agree for not instantiating failed component before
>>>> recovery, this keeps the approach similar to SU failover also.
>>>>  [Minh3]: There's one example of component failover that I would
>>>> like us
>>>>  to have a look
>>>>  - 2N application, SU4/SU5 has active/standby assignment
>>>> respectively,
>>>>  each SU has 3 components
>>>>  - Add a sleep of 10 seconds in clc script start command of first
>>>>  component C41 of SU4
>>>>  Steps:
>>>>  1- Kill C41 to trigger component failover
>>>>  2- SU4 goes for quiesced assignment
>>>>  3- SU5 goes for active assignment
>>>>  4- SU4 is removed its assignment
>>>>  5- Now there's a pause of 10 seconds due to clc script start, to
>>>> ensure
>>>>  that C41 is healthy
>>>>  6- Next SU4 has standby assignment.
>>>>From the above example, I think we can see some problems if
>>>> the
>>>>  re-instantiation of C41 is delayed:
>>>>  - Because C41 is faulty, it needs to be restarted ok because its
>>>> SU has
>>>>  assignment
>>>>  - Moving re-instantiation of C41 is further down that means the
>>>> recovery
>>>>  will take longer
>>>>  - What if re-instantiation of C41 leads to instantation-failed
>> [Praveen] If AMFND re-instantiate C41 after removal of assignment and
>> it moves to instantiation-failed then:
>> -Node will be rebooted if nodefailfastonterminationfaioure=true.
>> -ifnodefailfastonterminationfaioure=false then as per section 4.6 page
>> 212, SU will be marked INST_FAILED and AMF will have to terminate all
>> the components. Termination of other components will be easier if they
>> do not have assignments or pending assignments.
>>
>> If C41 is instantiated before removal of assignments and it moves to
>> INST_FAILED state, then AMFND will be terminating other comps of SU
>> when they are in the middle of quiesced or removal of assignment. So a
>> component will having different orders of quiesced/removal/terminate
>> callbacks in its mailbox. This will make thing complex.
> [Minh4]: I am not sure if I understand the complex thing you mentioned
> as it has been working like this for long time. If we are going to
> change the current behavior to the way that amfnd will instantiate
> failed component after removal assignment, then I think it should be

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-06 Thread minh chau
Hi Praveen,

Please see comments with [Minh5]

Thanks,
Minh

On 06/03/17 17:52, praveen malviya wrote:
> Hi Minh,
>
> Please see inline with [Praveen].
>
> Thanks,
> Praveen
>
> On 03-Mar-17 5:39 PM, minh chau wrote:
>> Hi Praveen,
>>
>> I have two comments with [Minh4].
>>
>> Thanks
>> Minh
>>
>> On 02/03/17 20:49, praveen malviya wrote:
>>> Hi Minh,
>>> Please see response with [Praveen].
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>>
>>> On 02-Mar-17 1:43 PM, minh chau wrote:
>>>> Hi,
>>>>
>>>> Thanks Gary.
>>>> @Nagu, Praveen: Have you had time to check the example in my previous
>>>> email?
>>>> The ticket #2179 is about to document that full escalation is 
>>>> supported
>>>> for SC absence feature, it is waiting for fix of #2233.
>>>> I think there's not big change in code for #2233, it's a matter of
>>>> decision to make for re-instantiation of failed component.
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 01/03/17 15:42, Gary Lee wrote:
>>>>> Hi
>>>>>
>>>>> It seems the component should be re-instantiated if it has no CSI.
>>>>> Whether or not there is an SI assigned should be irrelevant?
>>>>>
>>>>> Thanks
>>>>> Gary
>>>>>
>>>>> -Original Message-
>>>>> From: minh chau 
>>>>> Date: Thursday, 23 February 2017 at 3:16 pm
>>>>> To: Nagendra Kumar , Praveen Malviya
>>>>> 
>>>>> Cc: , gary ,
>>>>> , 
>>>>> 
>>>>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational
>>>>> message synchronizes with component failover sequence [#2233]
>>>>>
>>>>>  Hi Nagu, Praveen,
>>>>>   Please find my comment in [Minh3]
>>>>>   Thanks,
>>>>>  Minh
>>>>>   On 22/02/17 19:34, Nagendra Kumar wrote:
>>>>>  >>> Since in spec there is no specific discussion for
>>>>> comp-failover recovery for an unassigned comp, I will encourage other
>>>>> maintainers also to provide inputs.
>>>>>  > I do agree for not instantiating failed component before
>>>>> recovery, this keeps the approach similar to SU failover also.
>>>>>  [Minh3]: There's one example of component failover that I would
>>>>> like us
>>>>>  to have a look
>>>>>  - 2N application, SU4/SU5 has active/standby assignment
>>>>> respectively,
>>>>>  each SU has 3 components
>>>>>  - Add a sleep of 10 seconds in clc script start command of first
>>>>>  component C41 of SU4
>>>>>  Steps:
>>>>>  1- Kill C41 to trigger component failover
>>>>>  2- SU4 goes for quiesced assignment
>>>>>  3- SU5 goes for active assignment
>>>>>  4- SU4 is removed its assignment
>>>>>  5- Now there's a pause of 10 seconds due to clc script start, to
>>>>> ensure
>>>>>  that C41 is healthy
>>>>>  6- Next SU4 has standby assignment.
>>>>>From the above example, I think we can see some 
>>>>> problems if
>>>>> the
>>>>>  re-instantiation of C41 is delayed:
>>>>>  - Because C41 is faulty, it needs to be restarted ok because its
>>>>> SU has
>>>>>  assignment
>>>>>  - Moving re-instantiation of C41 is further down that means the
>>>>> recovery
>>>>>  will take longer
>>>>>  - What if re-instantiation of C41 leads to instantation-failed
>>> [Praveen] If AMFND re-instantiate C41 after removal of assignment and
>>> it moves to instantiation-failed then:
>>> -Node will be rebooted if nodefailfastonterminationfaioure=true.
>>> -ifnodefailfastonterminationfaioure=false then as per section 4.6 page
>>> 212, SU will be marked INST_FAILED and AMF will have to terminate all
>>> the components. Termination of other components will be easier if they
>>> do not have assignments or pending assignments.
>>>
>>> If C41 is instantiated before removal of assignments and it moves to
>>> INST_

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-06 Thread praveen malviya
Hi Minh,

Is there any harm if both the patches are merged? One patch adds strict 
checks for message ordering in case of comp-failover recovery of 
assigned or non-assigned component. Another patch ensures that if an 
assigned or non-assigned comp faults with comp-faiover recovery then 
first AMF will switchover whole SU (current implementation irrespective 
of red models) and after completion of switchover re-instantiation of 
failed comp will be attempted.
Also, I think, from headless perspective, the strict check of patch V1 
is important when comp-failover occurs in the absence of SCs.
So I have a minor query here: Is there any impact of late instantiation 
of comp when comp-failover occurs in SCs Absence?


Also I think now an enhancement ticket should be raised for 
implementation of comp-failover recovery as per spec for N-Way and N-Way 
active model.


Thanks,
Praveen



On 07-Mar-17 4:10 AM, minh chau wrote:
> Hi Praveen,
>
> Please see comments with [Minh5]
>
> Thanks,
> Minh
>
> On 06/03/17 17:52, praveen malviya wrote:
>> Hi Minh,
>>
>> Please see inline with [Praveen].
>>
>> Thanks,
>> Praveen
>>
>> On 03-Mar-17 5:39 PM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> I have two comments with [Minh4].
>>>
>>> Thanks
>>> Minh
>>>
>>> On 02/03/17 20:49, praveen malviya wrote:
>>>> Hi Minh,
>>>> Please see response with [Praveen].
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>>
>>>> On 02-Mar-17 1:43 PM, minh chau wrote:
>>>>> Hi,
>>>>>
>>>>> Thanks Gary.
>>>>> @Nagu, Praveen: Have you had time to check the example in my previous
>>>>> email?
>>>>> The ticket #2179 is about to document that full escalation is
>>>>> supported
>>>>> for SC absence feature, it is waiting for fix of #2233.
>>>>> I think there's not big change in code for #2233, it's a matter of
>>>>> decision to make for re-instantiation of failed component.
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>>
>>>>> On 01/03/17 15:42, Gary Lee wrote:
>>>>>> Hi
>>>>>>
>>>>>> It seems the component should be re-instantiated if it has no CSI.
>>>>>> Whether or not there is an SI assigned should be irrelevant?
>>>>>>
>>>>>> Thanks
>>>>>> Gary
>>>>>>
>>>>>> -Original Message-
>>>>>> From: minh chau 
>>>>>> Date: Thursday, 23 February 2017 at 3:16 pm
>>>>>> To: Nagendra Kumar , Praveen Malviya
>>>>>> 
>>>>>> Cc: , gary ,
>>>>>> ,
>>>>>> 
>>>>>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational
>>>>>> message synchronizes with component failover sequence [#2233]
>>>>>>
>>>>>>  Hi Nagu, Praveen,
>>>>>>   Please find my comment in [Minh3]
>>>>>>   Thanks,
>>>>>>  Minh
>>>>>>   On 22/02/17 19:34, Nagendra Kumar wrote:
>>>>>>  >>> Since in spec there is no specific discussion for
>>>>>> comp-failover recovery for an unassigned comp, I will encourage other
>>>>>> maintainers also to provide inputs.
>>>>>>  > I do agree for not instantiating failed component before
>>>>>> recovery, this keeps the approach similar to SU failover also.
>>>>>>  [Minh3]: There's one example of component failover that I would
>>>>>> like us
>>>>>>  to have a look
>>>>>>  - 2N application, SU4/SU5 has active/standby assignment
>>>>>> respectively,
>>>>>>  each SU has 3 components
>>>>>>  - Add a sleep of 10 seconds in clc script start command of first
>>>>>>  component C41 of SU4
>>>>>>  Steps:
>>>>>>  1- Kill C41 to trigger component failover
>>>>>>  2- SU4 goes for quiesced assignment
>>>>>>  3- SU5 goes for active assignment
>>>>>>  4- SU4 is removed its assignment
>>>>>>  5- Now there's a pause of 10 seconds due to clc script start, to
>>>>>> ensure
>>>>>>  that C41 is healthy

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-07 Thread minh chau
Hi Praveen,

I don't think we need both patches, one of those is enough to fix the 
problem of comp f/o in case unassigned component. When we have both 
patches, V2 patch will make reinstantiation of failed-unassigned comp 
after assignment's removal, so V1 is not needed anymore because su 
operational message (enabled) will always be sent after switchover.
I am not 100% sure how is the impact of moving reinstantiation of 
component after SI assignment's removal, but basically this change of 
behavior is exposed to applications
One potential impact I can think of, in either headless or normal 
cluster, is that failed component will have less time for its 
instantiation before receiving csi assignment (since reinstantiation of 
failed component has been started regardless SI switchover), so it could 
be a timing issue for application due to application's specific 
dependencies in instantiation phase.

Thanks,
Minh

On 07/03/17 16:34, praveen malviya wrote:
> Hi Minh,
>
> Is there any harm if both the patches are merged? One patch adds 
> strict checks for message ordering in case of comp-failover recovery 
> of assigned or non-assigned component. Another patch ensures that if 
> an assigned or non-assigned comp faults with comp-faiover recovery 
> then first AMF will switchover whole SU (current implementation 
> irrespective of red models) and after completion of switchover 
> re-instantiation of failed comp will be attempted.
> Also, I think, from headless perspective, the strict check of patch V1 
> is important when comp-failover occurs in the absence of SCs.
> So I have a minor query here: Is there any impact of late 
> instantiation of comp when comp-failover occurs in SCs Absence?
>
>
> Also I think now an enhancement ticket should be raised for 
> implementation of comp-failover recovery as per spec for N-Way and 
> N-Way active model.
>
>
> Thanks,
> Praveen
>
>
>
> On 07-Mar-17 4:10 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Please see comments with [Minh5]
>>
>> Thanks,
>> Minh
>>
>> On 06/03/17 17:52, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> Please see inline with [Praveen].
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 03-Mar-17 5:39 PM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> I have two comments with [Minh4].
>>>>
>>>> Thanks
>>>> Minh
>>>>
>>>> On 02/03/17 20:49, praveen malviya wrote:
>>>>> Hi Minh,
>>>>> Please see response with [Praveen].
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>>
>>>>>
>>>>> On 02-Mar-17 1:43 PM, minh chau wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Thanks Gary.
>>>>>> @Nagu, Praveen: Have you had time to check the example in my 
>>>>>> previous
>>>>>> email?
>>>>>> The ticket #2179 is about to document that full escalation is
>>>>>> supported
>>>>>> for SC absence feature, it is waiting for fix of #2233.
>>>>>> I think there's not big change in code for #2233, it's a matter of
>>>>>> decision to make for re-instantiation of failed component.
>>>>>>
>>>>>> Thanks,
>>>>>> Minh
>>>>>>
>>>>>> On 01/03/17 15:42, Gary Lee wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> It seems the component should be re-instantiated if it has no CSI.
>>>>>>> Whether or not there is an SI assigned should be irrelevant?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Gary
>>>>>>>
>>>>>>> -Original Message-
>>>>>>> From: minh chau 
>>>>>>> Date: Thursday, 23 February 2017 at 3:16 pm
>>>>>>> To: Nagendra Kumar , Praveen Malviya
>>>>>>> 
>>>>>>> Cc: , gary ,
>>>>>>> ,
>>>>>>> 
>>>>>>> Subject: Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational
>>>>>>> message synchronizes with component failover sequence [#2233]
>>>>>>>
>>>>>>>  Hi Nagu, Praveen,
>>>>>>>   Please find my comment in [Minh3]
>>>>>>>   Thanks,
>>>>>>>  Minh
>>>>>>>   On 22/02/17 19:34, Nagendr

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-07 Thread praveen malviya

On 08-Mar-17 9:11 AM, minh chau wrote:
> Hi Praveen,
>
> I don't think we need both patches, one of those is enough to fix the
> problem of comp f/o in case unassigned component. When we have both
> patches, V2 patch will make reinstantiation of failed-unassigned comp
> after assignment's removal, so V1 is not needed anymore because su
> operational message (enabled) will always be sent after switchover.
> I am not 100% sure how is the impact of moving reinstantiation of
> component after SI assignment's removal, but basically this change of
> behavior is exposed to applications
[Praveen] I have checked the comment in the ticket #2233 now which 
contains the problem description in SC absence case. I think V2 patch 
will not allow two su_oper message as recovery can be done only after 
first controller comes up. So I prefer V2 as a solution. With v2 comp 
instantiation is being done after completion of recovery for both 
assigned and unassigned components.

However, when comp-failover recovery is implemented in spec compliant 
way for N-Way and N-Way active model, then surely we need to instantiate 
component as early as possible.

> One potential impact I can think of, in either headless or normal
> cluster, is that failed component will have less time for its
> instantiation before receiving csi assignment (since reinstantiation of
> failed component has been started regardless SI switchover), so it could
> be a timing issue for application due to application's specific
> dependencies in instantiation phase.
[Praveen] This I did not get fully. But if instantiationlevel is 
configured for components in su, then instantiation of failed component 
of any level will not lead to instantiation of components of other levels:

from spec :"The instantiation level is, above all, a means to limit the 
load on the system during the instantiation process."

>
> Thanks,
> Minh
>
> On 07/03/17 16:34, praveen malviya wrote:
>> Hi Minh,
>>
>> Is there any harm if both the patches are merged? One patch adds
>> strict checks for message ordering in case of comp-failover recovery
>> of assigned or non-assigned component. Another patch ensures that if
>> an assigned or non-assigned comp faults with comp-faiover recovery
>> then first AMF will switchover whole SU (current implementation
>> irrespective of red models) and after completion of switchover
>> re-instantiation of failed comp will be attempted.
>> Also, I think, from headless perspective, the strict check of patch V1
>> is important when comp-failover occurs in the absence of SCs.
>> So I have a minor query here: Is there any impact of late
>> instantiation of comp when comp-failover occurs in SCs Absence?
>>
>>
>> Also I think now an enhancement ticket should be raised for
>> implementation of comp-failover recovery as per spec for N-Way and
>> N-Way active model.
>>
>>
>> Thanks,
>> Praveen
>>
>>
>>
>> On 07-Mar-17 4:10 AM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Please see comments with [Minh5]
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 06/03/17 17:52, praveen malviya wrote:
>>>> Hi Minh,
>>>>
>>>> Please see inline with [Praveen].
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>> On 03-Mar-17 5:39 PM, minh chau wrote:
>>>>> Hi Praveen,
>>>>>
>>>>> I have two comments with [Minh4].
>>>>>
>>>>> Thanks
>>>>> Minh
>>>>>
>>>>> On 02/03/17 20:49, praveen malviya wrote:
>>>>>> Hi Minh,
>>>>>> Please see response with [Praveen].
>>>>>>
>>>>>> Thanks,
>>>>>> Praveen
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02-Mar-17 1:43 PM, minh chau wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks Gary.
>>>>>>> @Nagu, Praveen: Have you had time to check the example in my
>>>>>>> previous
>>>>>>> email?
>>>>>>> The ticket #2179 is about to document that full escalation is
>>>>>>> supported
>>>>>>> for SC absence feature, it is waiting for fix of #2233.
>>>>>>> I think there's not big change in code for #2233, it's a matter of
>>>>>>> decision to make for re-instantiation of failed component.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Minh
>

Re: [devel] [PATCH 1 of 1] AMFND: Ensure su operational message synchronizes with component failover sequence [#2233]

2017-03-08 Thread minh chau
Hi Praveen,

As you choose V2, please push it and update the PR where this change is 
related to.

Thanks,
Minh

On 08/03/17 18:07, praveen malviya wrote:
>
> On 08-Mar-17 9:11 AM, minh chau wrote:
>> Hi Praveen,
>>
>> I don't think we need both patches, one of those is enough to fix the
>> problem of comp f/o in case unassigned component. When we have both
>> patches, V2 patch will make reinstantiation of failed-unassigned comp
>> after assignment's removal, so V1 is not needed anymore because su
>> operational message (enabled) will always be sent after switchover.
>> I am not 100% sure how is the impact of moving reinstantiation of
>> component after SI assignment's removal, but basically this change of
>> behavior is exposed to applications
> [Praveen] I have checked the comment in the ticket #2233 now which 
> contains the problem description in SC absence case. I think V2 patch 
> will not allow two su_oper message as recovery can be done only after 
> first controller comes up. So I prefer V2 as a solution. With v2 comp 
> instantiation is being done after completion of recovery for both 
> assigned and unassigned components.
>
> However, when comp-failover recovery is implemented in spec compliant 
> way for N-Way and N-Way active model, then surely we need to 
> instantiate component as early as possible.
>
>> One potential impact I can think of, in either headless or normal
>> cluster, is that failed component will have less time for its
>> instantiation before receiving csi assignment (since reinstantiation of
>> failed component has been started regardless SI switchover), so it could
>> be a timing issue for application due to application's specific
>> dependencies in instantiation phase.
> [Praveen] This I did not get fully. But if instantiationlevel is 
> configured for components in su, then instantiation of failed 
> component of any level will not lead to instantiation of components of 
> other levels:
>
> from spec :"The instantiation level is, above all, a means to limit 
> the load on the system during the instantiation process."
>
>>
>> Thanks,
>> Minh
>>
>> On 07/03/17 16:34, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> Is there any harm if both the patches are merged? One patch adds
>>> strict checks for message ordering in case of comp-failover recovery
>>> of assigned or non-assigned component. Another patch ensures that if
>>> an assigned or non-assigned comp faults with comp-faiover recovery
>>> then first AMF will switchover whole SU (current implementation
>>> irrespective of red models) and after completion of switchover
>>> re-instantiation of failed comp will be attempted.
>>> Also, I think, from headless perspective, the strict check of patch V1
>>> is important when comp-failover occurs in the absence of SCs.
>>> So I have a minor query here: Is there any impact of late
>>> instantiation of comp when comp-failover occurs in SCs Absence?
>>>
>>>
>>> Also I think now an enhancement ticket should be raised for
>>> implementation of comp-failover recovery as per spec for N-Way and
>>> N-Way active model.
>>>
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>>
>>> On 07-Mar-17 4:10 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> Please see comments with [Minh5]
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 06/03/17 17:52, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> Please see inline with [Praveen].
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>> On 03-Mar-17 5:39 PM, minh chau wrote:
>>>>>> Hi Praveen,
>>>>>>
>>>>>> I have two comments with [Minh4].
>>>>>>
>>>>>> Thanks
>>>>>> Minh
>>>>>>
>>>>>> On 02/03/17 20:49, praveen malviya wrote:
>>>>>>> Hi Minh,
>>>>>>> Please see response with [Praveen].
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Praveen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 02-Mar-17 1:43 PM, minh chau wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Thanks Gary.
>>>>>>>> @Nagu, Praveen: Have you had time to check t