---

** [tickets:#2161] AMF: Unexpected assignment due to headless recovery steps**

**Status:** assigned
**Milestone:** 5.2.FC
**Labels:** headless recovery 
**Created:** Thu Nov 03, 2016 02:44 AM UTC by Minh Hon Chau
**Last Updated:** Thu Nov 03, 2016 02:44 AM UTC
**Owner:** Minh Hon Chau
**Attachments:**

- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2161/attachment/log.tgz) 
(1.2 MB; application/x-compressed)


A si-swap test cases as below:
- Issue si-swap command, PL3 host SU1 having Active assignment, PL4 hosts SU2 
having Standby assignment
- Delay QUIESCED csiSet callback on SU1
- Stop both SCs
- Release QUIESCED csiSet callback on SU1
- Stop PL4
- Restart SCs
- Observation: SU1 gets csiRemove callback.

Below is a snippet of AMFD trace where the problem happens
Nov  3 12:13:49.902975 osafamfd [475:sgproc.cc:1045] >> avd_su_si_assign_evh: 
id:2, node:2030f, act:5, 'safSu=1,safSg=1,safApp=osaftest', '', ha:3, err:1, 
single:0
Nov  3 12:13:49.903450 osafamfd [475:sg_2n_fsm.cc:2354] >> susi_success: 
'safSu=1,safSg=1,safApp=osaftest' act=5, hastate=3, TEST sg_fsm_state=2
Nov  3 12:13:49.903455 osafamfd [475:sg_2n_fsm.cc:1873] >> 
susi_success_su_oper: 'safSu=1,safSg=1,safApp=osaftest' act=5, state=3
Nov  3 12:13:49.903459 osafamfd [475:sg_2n_fsm.cc:0477] >> avd_sg_2n_act_susi: 
'safSg=1,safApp=osaftest'
...
Nov  3 12:13:49.903605 osafamfd [475:sg_2n_fsm.cc:0555] << avd_sg_2n_act_susi: 
act: 'safSu=1,safSg=1,safApp=osaftest', stdby: 'safSu=2,safSg=1,safApp=osaftest'
Nov  3 12:13:49.903610 osafamfd [475:sgproc.cc:2358] >> avd_sg_su_si_del_snd: 
'safSu=1,safSg=1,safApp=osaftest'
...
Nov  3 12:13:49.904102 osafamfd [475:sgproc.cc:2396] << avd_sg_su_si_del_snd 
Nov  3 12:13:49.904107 osafamfd [475:sg.cc:1693] >> set_fsm_state 
Nov  3 12:13:49.904112 osafamfd [475:sg.cc:1696] TR safSg=1,safApp=osaftest 
sg_fsm_state 2 => 1

When AMFD receives assignment response, SG_2N::susi_success_su_oper is called. 
In normal scenario, the next step is SU2 will take over Active assignment 
because SU2 is IN-SERVICE.
In scenario of headless recovery, SU2 has absent SUSI so SU2 is OUT_OF_SERVICE 
and can not take Active assignment. 
The root cause of this problem is, when there is a SU becoming OUT-OF-SERVICE 
due to node reboot, the situation should be handle in node_fail_su_oper, where 
the assignment of OUT-OF-SERVICE SU will
be removed. In the other words, the susi_success_su_oper is not supposed to 
handle an OUT-OF-SERVICE SU but still having assignment. When 
susi_success_su_oper() is called, SU having assignment must be IN-SERVICE.
The original cause is from headless recovery, which currently outweights 
pending assignment than absent assignment, this can be seen in cluster.cc

cluster.cc
                if (i_sg->any_assignment_in_progress() == false) {
                        if (i_sg->any_assignment_absent() == false) {
                                i_sg->set_fsm_state(AVD_SG_FSM_STABLE);
                        } else {
                                // failover with ABSENT SUSI, which had already 
been removed during headless
                                i_sg->failover_absent_assignment();
                        }
                }

It works in most of SG FSM state, however there is a problem in SG_FSM_SU_OPER 
where assignment needs to be moved over to another SU.
A solution is that AMFD can revert the order of headless recovery, which 
prioritizes to failover absent assignment first. This change should also be 
working, because it is supported in non-headless, where a node restarts while 
assignment on another node is still in progress.




---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to