[tickets] [opensaf:tickets] #3317 amfnd: two NEW_ACTIVE amfd in split-brain scenario

Hieu Hong Hoang via Opensaf-tickets Thu, 09 Jun 2022 21:31:05 -0700

In normal case, the up flag will be reset by the mds thread whenever the amfnd 
receives a NEW_ACTIVE event:
~~~


    case NCSMDS_NEW_ACTIVE:
      if (evt_info->i_svc_id == NCSMDS_SVC_ID_AVD) {
        LOG_NO("AVD NEW_ACTIVE, adest:%" PRIu64, evt_info->i_dest);

        // sometimes NEW_ACTIVE director is received before
        // DOWN is received for the old director ..
        if (m_AVND_CB_IS_AVD_UP(cb)) {
          m_AVND_CB_AVD_UP_RESET(cb);
        }
~~~
That flag is also set by the amfnd main thread when amfnd process the 
AVND_EVT_MDS_AVD_UP event. The up flag isn't protected so it is not thread-safe.

This was what happen when the issue occurred:

Amfnd main thread |     Amfnd mds thread | Up flag
---------- | ---------- | ------ 
        | Receive NEW_ACTIVE event | True
        | Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
        | Receive NEW_ACTIVE event | False
        | Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
Receive AVND_EVT_MDS_AVD_UP | |False
Process AVND_EVT_MDS_AVD_UP | |True
Receive AVND_EVT_MDS_AVD_UP     | |True
Process AVND_EVT_MDS_AVD_UP     | |True

And this is in normal case:

Amfnd main thread | Amfnd mds thread | Up flag
---------- | ---------- | ------ 
        | Receive NEW_ACTIVE event | True
        | Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
Receive AVND_EVT_MDS_AVD_UP | | False
Process AVND_EVT_MDS_AVD_UP | | True
        |Receive NEW_ACTIVE event | True
        |Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
Receive AVND_EVT_MDS_AVD_UP      | | False
Process AVND_EVT_MDS_AVD_UP      | | True


---

** [tickets:#3317] amfnd: two NEW_ACTIVE amfd in split-brain scenario**

**Status:** accepted
**Milestone:** 5.22.11
**Created:** Fri Jun 10, 2022 04:00 AM UTC by Hieu Hong Hoang
**Last Updated:** Fri Jun 10, 2022 04:00 AM UTC
**Owner:** Hieu Hong Hoang


This issue happen when we test the system in a split-brain scenario. We split 
the cluster into partitions as follow: [[SC-1(ACT), SC-2(STB), SC-3], [ 
SC-4(ACT), SC-5(STB), SC-6],[ SC-7, SC-8, SC-9(STB), SC-10(ATC)]] then merge 
all nodes back. The quiesced SC-3 detected active nodes in other partitions up 
while the active SC-1 in the same partition was still alive, therefore no 
service events were raised for the active nodes in other partitions. When the 
SC-1 was down, one of the other active was notified as new active. After the 
new active SC went down, other active was notified. Finally, the SC-3 "amfnd" 
detected two NEW_ACTIVE amfd and rebooted.

Log analysis:

* SC-3 detected active amfd in other partitions up:
~~~
<143>1 2022-05-31T05:34:56.169467+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25315"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.169469+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25316"] MCM:API: LOCAL SVC INFO  : svc_id = AVND(13) | PWE id = 1 | 
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.16947+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25317"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 | 
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.169472+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25318"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169474+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25319"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169476+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25320"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169477+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25321"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169479+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25322"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.16948+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25323"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.169482+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25324"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169484+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25325"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.169486+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25326"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169487+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25327"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169489+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25328"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169491+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25329"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169493+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25330"] MDS:DB: adest_details: <rem_node[0x20a0f]:dest_pid[441]> 
<143>1 2022-05-31T05:34:56.169494+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25331"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.169496+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25332"] MDS:DB: sub_adest_details: <rem_node[0x20a0f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.169498+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25333"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169499+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25334"] << mds_mcm_svc_up
...

<143>1 2022-05-31T05:34:56.175867+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25497"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.175869+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25498"] MCM:API: LOCAL SVC INFO  : svc_id = AVND(13) | PWE id = 1 | 
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.17587+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25499"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 | 
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.175872+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25500"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175874+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25501"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175875+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25502"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175877+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25503"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175879+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25504"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175881+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25505"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175882+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25506"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175885+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25507"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.175887+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25508"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175888+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25509"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.17589+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25510"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.175891+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25511"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.175893+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25512"] MDS:DB: adest_details: <rem_node[0x2040f]:dest_pid[441]> 
<143>1 2022-05-31T05:34:56.175895+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25513"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.175897+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25514"] MDS:DB: sub_adest_details: <rem_node[0x2040f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.175898+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25515"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.1759+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25516"] << mds_mcm_svc_up
~~~
* SC-1 went down and SC-4 was notified as new active:
~~~
2022-05-31T05:34:56.214424+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25653"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535 
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2010f]:dest_pid[466]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.21448+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25662"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got 
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-4 went down and SC-10 was notified as new active:
~~~
2022-05-31T05:34:56.214606+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25731"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535 
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.214626+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25740"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got 
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x20a0f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-3 rebooted because it detected two active amfd:
~~~
2022-05-23 14:41:16.878 SC-3 osafamfnd[454]: Rebooting OpenSAF NodeId = 2030f 
EE Name = , Reason: AVD already up, OwnNodeId = 2030f, SupervisionTime = 60
2022-05-23 14:41:16.890 SC-3 opensaf_reboot: Rebooting local node; timeout=60
~~~


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #3317 amfnd: two NEW_ACTIVE amfd in split-brain scenario

Reply via email to