[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2017-03-09 Thread Minh Hon Chau
- **status**: review --> fixed
- **assigned_to**: Minh Hon Chau -->  nobody 



---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** fixed
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Mon Mar 06, 2017 06:51 AM UTC
**Owner:** nobody


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2017-03-05 Thread Praveen
Attached amfnd traces related to instantiation-failed of comp in cases 1) 
before removal of assignments, osafamfnd_v1 and  2) after removal of 
assignments, osafamfnd_v2 .


Attachments:

- 
[osafamfnd_v1](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/5dc5/attachment/osafamfnd_v1)
 (117.7 kB; application/octet-stream)
- 
[osafamfnd_v2](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/5dc5/attachment/osafamfnd_v2)
 (465.7 kB; application/octet-stream)


---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** review
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Tue Feb 21, 2017 01:39 AM UTC
**Owner:** Minh Hon Chau


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2017-02-20 Thread Minh Hon Chau
- Description has changed:

Diff:



--- old
+++ new
@@ -1,7 +1,7 @@
 This issue occurs as component failover recovery in context of locking node.
 
 **Configuration and steps:**
-1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
+1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 
 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
 3- Lock PL4
 4- Set a few seconds delay csi remove callback in component of SU4






---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** review
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Mon Feb 20, 2017 12:59 AM UTC
**Owner:** Minh Hon Chau


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2017-02-19 Thread Minh Hon Chau
Attach patch V2 from Praveen


Attachments:

- 
[2233_praveen.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/aac2/attachment/2233_praveen.diff)
 (1.3 kB; text/x-patch)


---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** review
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Mon Jan 23, 2017 01:30 AM UTC
**Owner:** Minh Hon Chau


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2017-01-22 Thread Minh Hon Chau
- **status**: accepted --> review



---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** review
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Wed Jan 18, 2017 12:20 AM UTC
**Owner:** Minh Hon Chau


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2017-01-17 Thread Minh Hon Chau
Hi Praveen,

In ticket #1902, the problem of component failover during headless was found 
here: https://sourceforge.net/p/opensaf/tickets/1902/#8990

Outlined logs:

2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered 
su_si_assign msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
SI:'', ha_state:'1', msg_act:'5', single_csi:'0', error:'1', msg_id:'2'
2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered 
oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
su_oper_state:'2', node_oper_state:'1', recovery:'3'
2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered 
oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
su_oper_state:'1', node_oper_state:'1', recovery:'0'

After headless, there are two su_oper_msg sent to amfd. The first one 
(recovery:'3')  triggers component failover sequence, so amfd will send QUIECED 
su_si assignment to amfnd, and amfnd should then send response of this QUIESED 
su_si assignment to amfd. The problem was at the time amfd processes the 
response of this QUIESCED su_si assignment, the SU's readiness state at amfd 
was changed to IN_SERVICE, because of the second su_oper_msg (recovery:'0'). 
The SG_2N::susi_success_sg_realign does not expect IN_SERVICE SU in this 
situation.

The same problem could be seen in non-headless scenario, which is in this 
ticket #2233. After amfd receives su_oper_msg (recovery:'3') and getting a bit 
busy due to RT update so that faulty component has enough time to instantiate 
and amfnd sends su_oper_msg (recovery:'0') earlier, then we can see the same 
problem of SG_2N::susi_success_sg_realign as in #1902. 

I think basically the entire sequence of component failover between amfd and 
amfnd does not design to include su_oper_msg (recovery:'0'). This message can 
comes into any unexpected points of sequence.

Attach the patch for this, it's based on similarity of su failover scenario 
which lets the su_oper_msg(recovery:'0') coming after su failover sequence (at 
su repair phase). I'm still testing but any comments are welcome.

Thanks,
Minh


Attachments:

- 
[2233_V2.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/cdb8/attachment/2233_V2.diff)
 (3.4 kB; text/x-patch)


---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** accepted
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Fri Dec 23, 2016 02:10 AM UTC
**Owner:** Minh Hon Chau


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as 

[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2016-12-22 Thread Minh Hon Chau
- **status**: unassigned --> accepted
- **assigned_to**: Minh Hon Chau



---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** accepted
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Tue Dec 20, 2016 03:04 AM UTC
**Owner:** Minh Hon Chau


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/intel___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery

2016-12-19 Thread Minh Hon Chau
2N models


Attachments:

- 
[app3_twon3su3si.xml](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/015b/attachment/app3_twon3su3si.xml)
 (14.5 kB; text/xml)


---

** [tickets:#2233] AMF: SG is unstable after component failover recovery**

**Status:** unassigned
**Milestone:** 5.0.2
**Labels:** unstable sg 
**Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau
**Last Updated:** Tue Dec 20, 2016 03:03 AM UTC
**Owner:** nobody


This issue occurs as component failover recovery in context of locking node.

**Configuration and steps:**
1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps 
safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon
2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment
3- Lock PL4
4- Set a few seconds delay csi remove callback in component of SU4
5- Set a few seconds delay quiesced csi set callback in component of SU5
6- When SU5 finishes active assignment, SU4 now receives assignment removal 
from amfd. In mean time, component failover report is triggered by component of 
SU5.
7- Now SU5 receives quiesced csi set callback from amfd
8- Release both callback in step 4 and 5

**Observation: **
SG unstable, could not repair failed SU (SU5) or lock/unlock any entities

At the time amfd process quiesced assignment response in REALIGN state, no 
action from amfd
> Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> 
> susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' 
> act=5, state=3
> Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR 
> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in 
> safSg=AmfDemoTwon,safApp=AmfDemoTwon
> Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> 
> avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwon,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', 
> su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon'
> Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR 
> su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)'
> Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << 
> avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', 
> stdby: '(null)'
> Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << 
> susi_success_sg_realign: rc:1

In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is 
currently IN_SERVICE
SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] 
as part of component failover report
Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2

The failed component is instantiated again, and generates another message 
su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE
Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: 
id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1

SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover 
recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation 
of failed component happens faster as in this test then the sg fsm results in 
unexpected sequence.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/intel___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets