[ https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Murmann reassigned GEODE-9822: ---------------------------------------- Assignee: Bill Burcham > Split-brain Certain During Network Partition in Two-Locator Cluster > ------------------------------------------------------------------- > > Key: GEODE-9822 > URL: https://issues.apache.org/jira/browse/GEODE-9822 > Project: Geode > Issue Type: Bug > Components: membership > Reporter: Bill Burcham > Assignee: Bill Burcham > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > In a two-locator cluster with default member weights and default setting > (true) of enable-network-partition-detection, if a long-lived network > partition separates the two members, a split-brain will arise: there will be > two coordinators at the same time. > The reason for this can be found in the GMSJoinLeave.isNetworkPartition() > method. That method's name is misleading. A name like isMajorityLost() would > probably be more apt. It needs to return true iff the weight of "crashed" > members (in the prospective view) is greater-than-or-equal-to half (50%) of > the total weight (of all members in the current view). > What the method actually does is return true iff the weight of "crashed" > members is greater-than 51% of the total weight. As a result, if we have two > members of equal weight, and the coordinator sees that the non-coordinator is > "crashed", the coordinator will keep running. If a network partition is > happening, and the non-coordinator is still running, then it will become a > coordinator and start producing views. Now we'll have two coordinators > producing views concurrently. > For this discussion "crashed" members are members for which the coordinator > has received a RemoveMemberRequest message. These are members that the > failure detector has deemed failed. Keep in mind the failure detector is > imperfect (it's not always right), and that's kind of the whole point of this > ticket: we've lost contact with the non-coordinator member, but that doesn't > mean it can't still be running (on the other side of a partition). > This bug is not limited to the two-locator scenario. Any set of members that > can be partitioned into two equal sets is susceptible. In fact it's even a > little worse than that. Any set of members that can be partitioned (into more > than one set), where any two-or-more sets, each still have 49% or more of the > total weight, will result in a split-brain -- This message was sent by Atlassian Jira (v8.20.1#820001)