Hi Mohan Kanakam,

I'm using "lxc" to run the cluster and tool "iptables" to split/merge network. 
I reproduce this ticket as follow:

* Add 5 seconds delay before promoting active by changing the fmd 
configuration(/etc/opensaf/fmd.conf): 
~~~
export FMS_PROMOTE_ACTIVE_TIMER=500             # 5 second
~~~

* Cluster have 5 SCs:
~~~
SC-1 ACT : SC-2 QSC : SC-3 QSC : SC-4 QSC : SC-5 STB
   ^          ^            ^         ^         ^
   |          |            |         |         |
   ---------------------------------------------
~~~
   (ACT: active, STB: standby, QSC: quiesced)

* Isolate SC-2 and SC-3. We will have two more active SCs because the isolated 
SC will be promoted to active soon:
~~~
SC-1 ACT : SC-2 ACT : SC-3 ACT : SC-4 QSC : SC-5 STB
   ^                                ^          ^
   |                                |          |
   ---------------------------------------------
~~~

* Unblock connection from SC-4 to SC-2 then unblock connection from SC-4 to 
SC-3. SC-4 won't receive any up events as a consequence of [ticket 
3281](https://sourceforge.net/p/opensaf/tickets/3281/)
~~~
SC-1 ACT : SC-2 ACT : SC-3 ACT : SC-4 QSC : SC-5 STB
  ^           ^          ^          ^ ^ ^      ^
  |           |          |          | | |      |
  |           |          ------------ | |      |
  |           -------------------------  |     |
-----------------------------------------------
~~~
* Stop SC-1, SC-2 and SC-3 sequentially. Now SC-4 receives a NEW_ACTIVE event 
for SC-2 then receives another NEW_ACTIVE event for SC-3. Note: SC-5 won't 
become active until the promote timer expired.

Another important condition is two NEW_ACTIVE events must be processed in 
nearly same time. However amfnd processes a NEW_ACTIVE event very fast. It's 
hard to reproduce it, therefore I  changed the source code to slow down amfnd. 
The detail of it is in attached file.

Best regards,
Hieu



Attachments:

- 
[amfnd_delay_processing.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/abf6973daa/630f/attachment/amfnd_delay_processing.diff)
 (342 Bytes; application/octet-stream)


---

** [tickets:#3317] amfnd: two NEW_ACTIVE amfd in split-brain scenario**

**Status:** accepted
**Milestone:** 5.22.11
**Created:** Fri Jun 10, 2022 04:00 AM UTC by Hieu Hong Hoang
**Last Updated:** Fri Jul 22, 2022 07:38 PM UTC
**Owner:** Hieu Hong Hoang


This issue happen when we test the system in a split-brain scenario. We split 
the cluster into partitions as follow: [[SC-1(ACT), SC-2(STB), SC-3], [ 
SC-4(ACT), SC-5(STB), SC-6],[ SC-7, SC-8, SC-9(STB), SC-10(ATC)]] then merge 
all nodes back. The quiesced SC-3 detected active nodes in other partitions up 
while the active SC-1 in the same partition was still alive, therefore no 
service events were raised for the active nodes in other partitions. When the 
SC-1 was down, one of the other active was notified as new active. After the 
new active SC went down, other active was notified. Finally, the SC-3 "amfnd" 
detected two NEW_ACTIVE amfd and rebooted.

Log analysis:

* SC-3 detected active amfd in other partitions up:
~~~
<143>1 2022-05-31T05:34:56.169467+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25315"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.169469+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25316"] MCM:API: LOCAL SVC INFO  : svc_id = AVND(13) | PWE id = 1 | 
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.16947+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25317"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 | 
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.169472+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25318"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169474+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25319"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.169476+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25320"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169477+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25321"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.169479+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25322"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.16948+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25323"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.169482+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25324"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169484+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25325"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.169486+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25326"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.169487+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25327"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169489+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25328"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.169491+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25329"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169493+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25330"] MDS:DB: adest_details: <rem_node[0x20a0f]:dest_pid[441]> 
<143>1 2022-05-31T05:34:56.169494+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25331"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.169496+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25332"] MDS:DB: sub_adest_details: <rem_node[0x20a0f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.169498+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25333"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.169499+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25334"] << mds_mcm_svc_up
...

<143>1 2022-05-31T05:34:56.175867+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25497"] >> mds_mcm_svc_up
<143>1 2022-05-31T05:34:56.175869+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25498"] MCM:API: LOCAL SVC INFO  : svc_id = AVND(13) | PWE id = 1 | 
VDEST id = 65535 |
<143>1 2022-05-31T05:34:56.17587+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25499"] MCM:API: REMOTE SVC INFO : svc_id = AVD(12) | PWE id = 1 | 
VDEST id = 1 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2022-05-31T05:34:56.175872+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25500"] >> mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175874+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25501"] << mds_svc_tbl_query
<143>1 2022-05-31T05:34:56.175875+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25502"] >> mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175877+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25503"] << mds_subtn_tbl_get_details
<143>1 2022-05-31T05:34:56.175879+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25504"] >> mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175881+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25505"] << mds_mcm_validate_scope
<143>1 2022-05-31T05:34:56.175882+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25506"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175885+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25507"] MDS:DB: Subscription Result not present
<143>1 2022-05-31T05:34:56.175887+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25508"] << mds_get_subtn_res_tbl_by_adest
<143>1 2022-05-31T05:34:56.175888+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25509"] >> mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.17589+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25510"] << mds_subtn_res_tbl_get
<143>1 2022-05-31T05:34:56.175891+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25511"] >> mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.175893+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25512"] MDS:DB: adest_details: <rem_node[0x2040f]:dest_pid[441]> 
<143>1 2022-05-31T05:34:56.175895+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25513"] << get_subtn_adest_details
<143>1 2022-05-31T05:34:56.175897+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25514"] MDS:DB: sub_adest_details: <rem_node[0x2040f]:dest_pid[441]>
<143>1 2022-05-31T05:34:56.175898+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25515"] << mds_subtn_res_tbl_add
<143>1 2022-05-31T05:34:56.1759+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25516"] << mds_mcm_svc_up
~~~
* SC-1 went down and SC-4 was notified as new active:
~~~
2022-05-31T05:34:56.214424+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25653"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535 
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2010f]:dest_pid[466]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.21448+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25662"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got 
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-4 went down and SC-10 was notified as new active:
~~~
2022-05-31T05:34:56.214606+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25731"] MCM:API: svc_down : svc_id = AVND(13) on DEST id = 65535 
got NCSMDS_DOWN for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x2040f]:dest_pid[441]>, rem_svc_pvt_ver=7
...
2022-05-31T05:34:56.214626+02:00 SC-3 osafamfnd 454 mds.log [meta 
sequenceId="25740"] MCM:API: svc_up : svc_id = AVND(13) on DEST id = 65535 got 
NCSMDS_NEW_ACTIVE for svc_id = AVD(12) on Vdest id = 1 Adest = 
<rem_node[0x20a0f]:dest_pid[441]>, rem_svc_pvt_ver=7
~~~
* SC-3 rebooted because it detected two active amfd:
~~~
2022-05-23 14:41:16.878 SC-3 osafamfnd[454]: Rebooting OpenSAF NodeId = 2030f 
EE Name = , Reason: AVD already up, OwnNodeId = 2030f, SupervisionTime = 60
2022-05-23 14:41:16.890 SC-3 opensaf_reboot: Rebooting local node; timeout=60
~~~


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to