In normal case, the up flag will be reset by the mds thread whenever the amfnd
receives a NEW_ACTIVE event:
~~~
case NCSMDS_NEW_ACTIVE:
if (evt_info->i_svc_id == NCSMDS_SVC_ID_AVD) {
LOG_NO("AVD NEW_ACTIVE, adest:%" PRIu64, evt_info->i_dest);
// sometimes NEW_ACTIVE director is received before
// DOWN is received for the old director ..
if (m_AVND_CB_IS_AVD_UP(cb)) {
m_AVND_CB_AVD_UP_RESET(cb);
}
~~~
That flag is also set by the amfnd main thread when amfnd process the
AVND_EVT_MDS_AVD_UP event. The up flag isn't protected so it is not thread-safe.
This was what happen when the issue occurred:
Amfnd main thread | Amfnd mds thread | Up flag
---------- | ---------- | ------
| Receive NEW_ACTIVE event | True
| Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
| Receive NEW_ACTIVE event | False
| Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
Receive AVND_EVT_MDS_AVD_UP | |False
Process AVND_EVT_MDS_AVD_UP | |True
Receive AVND_EVT_MDS_AVD_UP | |True
Process AVND_EVT_MDS_AVD_UP | |True
And this is in normal case:
Amfnd main thread | Amfnd mds thread | Up flag
---------- | ---------- | ------
| Receive NEW_ACTIVE event | True
| Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
Receive AVND_EVT_MDS_AVD_UP | | False
Process AVND_EVT_MDS_AVD_UP | | True
|Receive NEW_ACTIVE event | True
|Reset up flag, and send AVND_EVT_MDS_AVD_UP event | False
Receive AVND_EVT_MDS_AVD_UP | | False
Process AVND_EVT_MDS_AVD_UP | | True
---
** [tickets:#3317] amfnd: two NEW_ACTIVE amfd in split-brain scenario**
**Status:** accepted
**Milestone:** 5.22.11
**Created:** Fri Jun 10, 2022 04:00 AM UTC by Hieu Hong Hoang
**Last Updated:** Fri Jun 10, 2022 04:00 AM UTC
**Owner:** Hieu Hong Hoang
This issue happen when we test the system in a split-brain scenario. We split
the cluster into partitions as follow: [[SC-1(ACT), SC-2(STB), SC-3], [
SC-4(ACT), SC-5(STB), SC-6],[ SC-7, SC-8, SC-9(STB), SC-10(ATC)]] then merge
all nodes back. The quiesced SC-3 detected active nodes in other partitions up
while the active SC-1 in the same partition was still alive, therefore no
service events were raised for the active nodes in other partitions. When the
SC-1 was down, one of the other active was notified as new active. After the
new active SC went down, other active was notified. Finally, the SC-3 "amfnd"
detected two NEW_ACTIVE amfd and rebooted.
Log analysis:
* SC-3 detected active amfd in other partitions up:
~~~
<143>1 2022-05-31T05:34:56.169467+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25315"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.169469+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25316"] MCM:API: LOCAL SVC INFO : svc_id = AVND(13) | PWE id = 1 |
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.16947+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25317"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 |
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.169472+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25318"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169474+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25319"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169476+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25320"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169477+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25321"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169479+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25322"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.16948+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25323"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.169482+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25324"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169484+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25325"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.169486+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25326"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169487+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25327"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169489+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25328"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169491+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25329"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169493+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25330"] MDS:DB: adest_details: <rem_node[0x20a0f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.169494+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25331"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.169496+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25332"] MDS:DB: sub_adest_details: <rem_node[0x20a0f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.169498+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25333"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169499+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25334"] << mds_mcm_svc_up
...
<143>1 2022-05-31T05:34:56.175867+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25497"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.175869+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25498"] MCM:API: LOCAL SVC INFO : svc_id = AVND(13) | PWE id = 1 |
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.17587+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25499"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 |
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.175872+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25500"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175874+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25501"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175875+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25502"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175877+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25503"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175879+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25504"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175881+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25505"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175882+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25506"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175885+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25507"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.175887+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25508"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175888+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25509"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.17589+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25510"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.175891+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25511"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.175893+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25512"] MDS:DB: adest_details: <rem_node[0x2040f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.175895+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25513"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.175897+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25514"] MDS:DB: sub_adest_details: <rem_node[0x2040f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.175898+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25515"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.1759+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25516"] << mds_mcm_svc_up
~~~
* SC-1 went down and SC-4 was notified as new active:
~~~
2022-05-31T05:34:56.214424+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25653"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x2010f]:dest_pid[466]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.21448+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25662"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-4 went down and SC-10 was notified as new active:
~~~
2022-05-31T05:34:56.214606+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25731"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.214626+02:00 SC-3 osafamfnd 454 mds.log [meta
sequenceId="25740"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest =
<rem_node[0x20a0f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-3 rebooted because it detected two active amfd:
~~~
2022-05-23 14:41:16.878 SC-3 osafamfnd[454]: Rebooting OpenSAF NodeId = 2030f
EE Name = , Reason: AVD already up, OwnNodeId = 2030f, SupervisionTime = 60
2022-05-23 14:41:16.890 SC-3 opensaf_reboot: Rebooting local node; timeout=60
~~~
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets