https://sourceforge.net/p/opensaf/mailman/message/36439057/
---
** [tickets:#2934] imm: cluster is rebooted after split-brain recovery**
**Status:** review
**Milestone:** 5.18.12
**Created:** Wed Oct 03, 2018 04:06 AM UTC by Vu Minh Nguyen
**Last Updated:** Fri Oct 12, 2018 06:38 AM UTC
**Owner:** Vu Minh Nguyen
**Attachments:**
-
[split_brain.log](https://sourceforge.net/p/opensaf/tickets/2934/attachment/split_brain.log)
(13.8 kB; application/octet-stream)
Here is the scenario:
1. Start cluster (SC Absence enabled)
2. Load 2N app on PL-3 (active su), PL-4 (standby su)
3. Split network: [SC-1, PL-3] vs [SC-2, PL-4, PL-5]
4. SU on PL-4 is assigned active
5. Merge network
During split-brain, there are possibilities of having:
1) 02 coordinators.
2) 02 inconsistent RAM databases, different global counters (epoch, fevs, CCB
id, etc.).
3) Different counters.
4) 02 PBE processes access a shared sqlite database, so this database may not
be in synced with one of databases on RAM.
With these possiblities in mind, following issues may come up after split-brain
recovery:
**1) Cluster is rebooted since "fail to find candidate for new IMMND
coordinator". The attached syslog shows the whole picture:**
a) Promote coord on PL-3 with epoch counter = 5.
b) PL-4, PL-5 introduce themself and update rulling epoch counter = 6 on active
IMMD
c) The coord is terminated due to epoch counter mismatched (5 != 6)
d) Coord on PL-4 is picked.
e) IMMND on PL-4 and PL-5 are restarted due to OUT OF ORDER (fevs counter is
not aligned with one on PL-3).
f) No coord candidate found, and cluster reboot is triggered.
**2) If keeping one SC (oldest), and restarting the younger to recover from
split-brain, the RAM database may be in consistent with the sqlite database
since the coord is still running.**
**Proposed solutions:**
1) For issue #1: once having the coord, don't update global counters on active
IMMD from joining IMMNDs, except ones comming from the coord itself.
2) For issue #2: introduce a new admin operation towards IMM. User who knows
well about split-brain recovery just happens uses that method to inform the
case to IMM, so that the sqlite dabase will be re-generated from RAM which is
similar to recovery from headless case.
This ticket is to deal with issue #1, and the ticket [#2940] is for the issue
#2.
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets