https://sourceforge.net/p/opensaf/mailman/message/36439057/


---

** [tickets:#2934] imm: cluster is rebooted after split-brain recovery**

**Status:** review
**Milestone:** 5.18.12
**Created:** Wed Oct 03, 2018 04:06 AM UTC by Vu Minh Nguyen
**Last Updated:** Fri Oct 12, 2018 06:38 AM UTC
**Owner:** Vu Minh Nguyen
**Attachments:**

- 
[split_brain.log](https://sourceforge.net/p/opensaf/tickets/2934/attachment/split_brain.log)
 (13.8 kB; application/octet-stream)


Here is the scenario:
1. Start cluster (SC Absence enabled)
2. Load 2N app on PL-3 (active su), PL-4 (standby su)
3. Split network: [SC-1, PL-3] vs [SC-2, PL-4, PL-5]
4. SU on PL-4 is assigned active
5. Merge network

During split-brain, there are possibilities of having:
1) 02 coordinators.
2) 02 inconsistent RAM databases, different global  counters (epoch, fevs, CCB 
id, etc.).
3) Different counters. 
4) 02 PBE processes access a shared sqlite database, so this database may not 
be in synced with one of databases on RAM. 

With these possiblities in mind, following issues may come up after split-brain 
recovery: 
**1) Cluster is rebooted since "fail to find candidate for new IMMND 
coordinator". The attached syslog shows the whole picture:** 
a) Promote coord on PL-3 with epoch counter = 5. 
b) PL-4, PL-5 introduce themself and update rulling epoch counter = 6 on active 
IMMD
c) The coord is terminated due to epoch counter mismatched (5 != 6)
d) Coord on PL-4 is picked.
e) IMMND on PL-4 and PL-5 are restarted due to OUT OF ORDER (fevs counter is 
not aligned with one on PL-3).
f) No coord candidate found, and cluster reboot is triggered.

**2) If keeping one SC (oldest), and restarting the younger to recover from 
split-brain, the RAM database may be in consistent with the sqlite database 
since the coord is still running.**

**Proposed solutions:**
1) For issue #1: once having the coord, don't update global counters on active 
IMMD from joining IMMNDs, except ones comming from the coord itself.
2) For issue #2: introduce a new admin operation towards IMM. User who knows 
well about split-brain recovery just happens uses that method to inform the 
case to IMM, so that the sqlite dabase will be re-generated from RAM which is 
similar to recovery from headless case.

This ticket is to deal with issue #1, and the ticket [#2940] is for the issue 
#2.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to