[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
- **status**: review --> fixed - **assigned_to**: Minh Hon Chau --> nobody --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** fixed **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Mon Mar 06, 2017 06:51 AM UTC **Owner:** nobody This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is currently IN_SERVICE SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] as part of component failover report Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2 The failed component is instantiated again, and generates another message su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1 SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation of failed component happens faster as in this test then the sg fsm results in unexpected sequence. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Announcing the Oxford Dictionaries API! The API offers world-renowned dictionary content that is easy and intuitive to access. Sign up for an account today to start using our lexical data to power your apps and projects. Get started today and enter our developer competition. http://sdm.link/oxford___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
Attached amfnd traces related to instantiation-failed of comp in cases 1) before removal of assignments, osafamfnd_v1 and 2) after removal of assignments, osafamfnd_v2 . Attachments: - [osafamfnd_v1](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/5dc5/attachment/osafamfnd_v1) (117.7 kB; application/octet-stream) - [osafamfnd_v2](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/5dc5/attachment/osafamfnd_v2) (465.7 kB; application/octet-stream) --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** review **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Tue Feb 21, 2017 01:39 AM UTC **Owner:** Minh Hon Chau This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is currently IN_SERVICE SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] as part of component failover report Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2 The failed component is instantiated again, and generates another message su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1 SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation of failed component happens faster as in this test then the sg fsm results in unexpected sequence. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
- Description has changed: Diff: --- old +++ new @@ -1,7 +1,7 @@ This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** -1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon +1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** review **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Mon Feb 20, 2017 12:59 AM UTC **Owner:** Minh Hon Chau This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is currently IN_SERVICE SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] as part of component failover report Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2 The failed component is instantiated again, and generates another message su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1 SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation of failed component happens faster as in this test then the sg fsm results in unexpected sequence. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
Attach patch V2 from Praveen Attachments: - [2233_praveen.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/aac2/attachment/2233_praveen.diff) (1.3 kB; text/x-patch) --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** review **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Mon Jan 23, 2017 01:30 AM UTC **Owner:** Minh Hon Chau This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is currently IN_SERVICE SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] as part of component failover report Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2 The failed component is instantiated again, and generates another message su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1 SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation of failed component happens faster as in this test then the sg fsm results in unexpected sequence. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
- **status**: accepted --> review --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** review **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Wed Jan 18, 2017 12:20 AM UTC **Owner:** Minh Hon Chau This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is currently IN_SERVICE SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] as part of component failover report Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2 The failed component is instantiated again, and generates another message su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1 SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation of failed component happens faster as in this test then the sg fsm results in unexpected sequence. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
Hi Praveen, In ticket #1902, the problem of component failover during headless was found here: https://sourceforge.net/p/opensaf/tickets/1902/#8990 Outlined logs: 2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered su_si_assign msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', SI:'', ha_state:'1', msg_act:'5', single_csi:'0', error:'1', msg_id:'2' 2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_oper_state:'2', node_oper_state:'1', recovery:'3' 2016-12-19 10:54:20 PL-5 osafamfnd[416]: NO Found and resend buffered oper_state msg for SU:'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_oper_state:'1', node_oper_state:'1', recovery:'0' After headless, there are two su_oper_msg sent to amfd. The first one (recovery:'3') triggers component failover sequence, so amfd will send QUIECED su_si assignment to amfnd, and amfnd should then send response of this QUIESED su_si assignment to amfd. The problem was at the time amfd processes the response of this QUIESCED su_si assignment, the SU's readiness state at amfd was changed to IN_SERVICE, because of the second su_oper_msg (recovery:'0'). The SG_2N::susi_success_sg_realign does not expect IN_SERVICE SU in this situation. The same problem could be seen in non-headless scenario, which is in this ticket #2233. After amfd receives su_oper_msg (recovery:'3') and getting a bit busy due to RT update so that faulty component has enough time to instantiate and amfnd sends su_oper_msg (recovery:'0') earlier, then we can see the same problem of SG_2N::susi_success_sg_realign as in #1902. I think basically the entire sequence of component failover between amfd and amfnd does not design to include su_oper_msg (recovery:'0'). This message can comes into any unexpected points of sequence. Attach the patch for this, it's based on similarity of su failover scenario which lets the su_oper_msg(recovery:'0') coming after su failover sequence (at su repair phase). I'm still testing but any comments are welcome. Thanks, Minh Attachments: - [2233_V2.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/cdb8/attachment/2233_V2.diff) (3.4 kB; text/x-patch) --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** accepted **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Fri Dec 23, 2016 02:10 AM UTC **Owner:** Minh Hon Chau This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as
[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
- **status**: unassigned --> accepted - **assigned_to**: Minh Hon Chau --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** accepted **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Tue Dec 20, 2016 03:04 AM UTC **Owner:** Minh Hon Chau This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is currently IN_SERVICE SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] as part of component failover report Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2 The failed component is instantiated again, and generates another message su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1 SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation of failed component happens faster as in this test then the sg fsm results in unexpected sequence. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/intel___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2233 AMF: SG is unstable after component failover recovery
2N models Attachments: - [app3_twon3su3si.xml](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/fcada78e/015b/attachment/app3_twon3su3si.xml) (14.5 kB; text/xml) --- ** [tickets:#2233] AMF: SG is unstable after component failover recovery** **Status:** unassigned **Milestone:** 5.0.2 **Labels:** unstable sg **Created:** Tue Dec 20, 2016 03:00 AM UTC by Minh Hon Chau **Last Updated:** Tue Dec 20, 2016 03:03 AM UTC **Owner:** nobody This issue occurs as component failover recovery in context of locking node. **Configuration and steps:** 1- Set up 2N model, PL4 hosts SU4, PL5 hosts SU5, PL3 hosts SU5B. Si deps safSi=AmfDemoTwon2 depends safSi=AmfDemoTwon1 depends safSi=AmfDemoTwon 2- Bring up 2N app, SU4 has active assignment, SU5 has standby assignment 3- Lock PL4 4- Set a few seconds delay csi remove callback in component of SU4 5- Set a few seconds delay quiesced csi set callback in component of SU5 6- When SU5 finishes active assignment, SU4 now receives assignment removal from amfd. In mean time, component failover report is triggered by component of SU5. 7- Now SU5 receives quiesced csi set callback from amfd 8- Release both callback in step 4 and 5 **Observation: ** SG unstable, could not repair failed SU (SU5) or lock/unlock any entities At the time amfd process quiesced assignment response in REALIGN state, no action from amfd > Dec 20 13:23:22.272043 osafamfd [487:sg_2n_fsm.cc:1448] >> > susi_success_sg_realign: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' > act=5, state=3 > Dec 20 13:23:22.272048 osafamfd [487:sg.cc:1756] TR > safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon found in > safSg=AmfDemoTwon,safApp=AmfDemoTwon > Dec 20 13:23:22.272054 osafamfd [487:sg_2n_fsm.cc:0477] >> > avd_sg_2n_act_susi: 'safSg=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272059 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwon,safApp=AmfDemoTwon' > Dec 20 13:23:22.272065 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep1,safApp=AmfDemoTwon' > Dec 20 13:23:22.272071 osafamfd [487:sg_2n_fsm.cc:0486] TR > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon', > su'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > si'safSi=AmfDemoTwonDep2,safApp=AmfDemoTwon' > Dec 20 13:23:22.272076 osafamfd [487:sg_2n_fsm.cc:0501] TR > su_1'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', su_2'(null)' > Dec 20 13:23:22.272082 osafamfd [487:sg_2n_fsm.cc:0555] << > avd_sg_2n_act_susi: act: 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon', > stdby: '(null)' > Dec 20 13:23:22.272087 osafamfd [487:sg_2n_fsm.cc:1862] << > susi_success_sg_realign: rc:1 In this sg fsm function, SU5 is expected as OUT_OF_SERVICE, but SU5 is currently IN_SERVICE SU5 firstly is reported as OUT_OF_SERVICE from message su_oper_state[DISABLED] as part of component failover report Dec 20 13:22:56.241508 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:56, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:2 The failed component is instantiated again, and generates another message su_oper_state[ENABLED], it sets SU5 back to IN_SERVICE Dec 20 13:22:58.481319 osafamfd [487:sgproc.cc:0656] >> avd_su_oper_state_evh: id:62, node:2050f, 'safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon' state:1 SU5 should be OUT_OF_SERVICE when amfd orchestrates component failover recovery, which initiates QUIESCED assignment of SU5 first. If re-instantiation of failed component happens faster as in this test then the sg fsm results in unexpected sequence. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/intel___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets