[ 
https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494855#comment-17494855
 ] 

Barrett Oglesby commented on GEODE-9910:
----------------------------------------

Here is some analysis of this issue.
h3. Server Addresses

node 1:

membership: 10.196.55.141(15661)<ec><v0>:42000
locator: 10.196.55.141:10335

node 2:

membership: 10.196.55.142(19002)<ec><v1>:42000
locator: 10.196.55.142:10335
h3. Node2 Initial Disconnect

node2 lost connectivity with node1 and removed it:
{noformat}
2021-11-28 04:03:45,084 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure 
Detection thread 9] [] Availability check failed for member 
10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:03:45,084 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure 
Detection thread 9] [] Requesting removal of suspect member 
10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:03:45,085 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure 
Detection thread 9] [] This member is becoming the membership coordinator with 
address 10.196.55.142(19002)<ec><v1>:42000
{noformat}
It then realized that quorum had been lost (node1 was coordinator with 
weight=15; node2 was not coordinator with weight=10):
{noformat}
2021-11-28 04:03:45,091 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode 
Membership View Creator] [] View Creator thread is starting
2021-11-28 04:03:45,091 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode 
Membership View Creator] []   10.196.55.141(15661)<ec><v0>:42000 had a weight 
of 15
2021-11-28 04:03:45,092 WARN  
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode 
Membership View Creator] [] total weight lost in this view change is 15 of 25.  
Quorum has been lost!
2021-11-28 04:03:45,092 FATAL 
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode 
Membership View Creator] [] Possible loss of quorum due to the loss of 1 cache 
processes: [10.196.55.141(15661)<ec><v0>:42000]
{noformat}
And disconnected itself from the distributed system:
{noformat}
2021-11-28 04:03:46,093 FATAL 
[org.apache.geode.distributed.internal.membership.gms.Services]-[Geode 
Membership View Creator] [] Membership service failure: Exiting due to possible 
network partition event due to loss of 1 cache processes: 
[10.196.55.141(15661)<ec><v0>:42000]
org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
 Exiting due to possible network partition event due to loss of 1 cache 
processes: [10.196.55.141(15661)<ec><v0>:42000]
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:1787)
 [geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1122)
 [geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.access$1300(GMSJoinLeave.java:80)
 [geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.prepareAndSendView(GMSJoinLeave.java:2588)
 [geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.sendInitialView(GMSJoinLeave.java:2204)
 [geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.run(GMSJoinLeave.java:2286)
 [geode-membership-1.14.0.jar:?]
{noformat}
It stopped its locator:
{noformat}
2021-11-28 04:03:46,794 INFO 
[org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] 
Distribution Locator on 
vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 is 
stopped{noformat}
h3. Node2 Reconnect Attempt 1
{noformat}
2021-11-28 04:04:46,800 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Attempting to reconnect to the distributed system.  This is attempt #1.
{noformat}
The first retry attempt failed to get quorum (it needed a weight of 13 but is 
10):
{noformat}
2021-11-28 04:04:46,810 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] performing a quorum check to see if location services can be started early
2021-11-28 04:04:46,810 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] beginning quorum check with GMSQuorumChecker on view 
View[10.196.55.141(15661)<ec><v0>:42000|1] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}, 10.196.55.142(19002)<ec><v1>:42000]
2021-11-28 04:04:46,810 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: sending request to 10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:04:46,810 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: sending request to 10.196.55.142(19002)<ec><v1>:42000
2021-11-28 04:04:46,812 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] received ping-pong response from /10.196.55.142<v1>:42000
2021-11-28 04:04:46,812 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: mapped address to member ID 10.196.55.142(19002)<ec><v1>:42000
2021-11-28 04:04:46,812 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: waiting up to 15000ms to receive a quorum of responses
...
2021-11-28 04:05:01,821 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: timeout waiting for responses.  1 responses received
2021-11-28 04:05:01,822 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: contacted 1 processes with 10 member weight units.  Threshold 
for a quorum is 13
{noformat}
Since quorum failed, the locator did not restart:
{noformat}
2021-11-28 04:05:01,822 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] quorum check failed - not allowing location services to start early
{noformat}
It then was able to send a *FindCoordinatorRequest* and receive a 
*FindCoordinatorResponse* from 10.196.55.141:10335 (node1's locator). It then 
tries to join the distributed system which fails. It does this 4 times, but 
isn't able to join the distributed system:
{noformat}
2021-11-28 04:05:02,078 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: 
java.net.ConnectException: Connection refused (Connection refused)
2021-11-28 04:05:06,946 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] received 
FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, 
fromView=true, viewId=1, registrants=[10.196.55.142(19002)<ec>:42000], 
senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection 
enabled=true, locators preferred as coordinators=true, 
view=View[10.196.55.141(15661)<ec><v0>:42000|1] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]) from locator 
HostAndPort[/10.196.55.141:10335]
2021-11-28 04:05:06,948 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Locator's address indicates it is part of a distributed system so I will 
not become membership coordinator on this attempt to join
2021-11-28 04:05:10,015 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these 
possible coordinators: [10.196.55.141(15661)<ec><v0>:42000]
2021-11-28 04:05:10,015 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Discovery state after looking for membership coordinator is 
locatorsContacted=1; findInViewResponses=0; alreadyTried=[]; registrants=[]; 
possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=1; 
hasContactedAJoinedLocator=true; 
view=View[10.196.55.141(15661)<ec><v0>:42000|1] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]; responses=[]
2021-11-28 04:05:10,016 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000
*2021-11-28 04:05:10,016 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Attempting to join the distributed system through coordinator 
10.196.55.141(15661)<ec><v0>:42000 using address 10.196.55.142(19002)<ec>:42000

2021-11-28 04:05:22,017 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: 
java.net.ConnectException: Connection refused (Connection refused)
2021-11-28 04:05:23,050 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] received 
FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, 
fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], 
senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection 
enabled=true, locators preferred as coordinators=true, 
view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]) from locator 
HostAndPort[/10.196.55.141:10335]
2021-11-28 04:05:23,052 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Locator's address indicates it is part of a distributed system so I will 
not become membership coordinator on this attempt to join
2021-11-28 04:05:26,111 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these 
possible coordinators: [10.196.55.141(15661)<ec><v0>:42000]
2021-11-28 04:05:26,111 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Discovery state after looking for membership coordinator is 
locatorsContacted=1; findInViewResponses=0; 
alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; registrants=[]; 
possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; 
hasContactedAJoinedLocator=true; 
view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]; responses=[]
2021-11-28 04:05:26,113 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:05:26,113 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting 
for a join-response

2021-11-28 04:05:38,113 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: 
java.net.ConnectException: Connection refused (Connection refused)
2021-11-28 04:05:39,145 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] received 
FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, 
fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], 
senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection 
enabled=true, locators preferred as coordinators=true, 
view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]) from locator 
HostAndPort[/10.196.55.141:10335]
2021-11-28 04:05:39,147 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Locator's address indicates it is part of a distributed system so I will 
not become membership coordinator on this attempt to join
2021-11-28 04:05:42,207 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these 
possible coordinators: [10.196.55.141(15661)<ec><v0>:42000]
2021-11-28 04:05:42,207 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Discovery state after looking for membership coordinator is 
locatorsContacted=1; findInViewResponses=0; 
alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; 
registrants=[10.196.55.142(19002)<ec>:42000]; 
possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; 
hasContactedAJoinedLocator=true; 
view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]; responses=[]
2021-11-28 04:05:42,209 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:05:42,209 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting 
for a join-response

2021-11-28 04:05:54,210 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: 
java.net.ConnectException: Connection refused (Connection refused)
2021-11-28 04:05:55,242 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] received 
FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, 
fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], 
senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection 
enabled=true, locators preferred as coordinators=true, 
view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]) from locator 
HostAndPort[/10.196.55.141:10335]
2021-11-28 04:05:55,244 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Locator's address indicates it is part of a distributed system so I will 
not become membership coordinator on this attempt to join
2021-11-28 04:05:58,303 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these 
possible coordinators: [10.196.55.141(15661)<ec><v0>:42000]
2021-11-28 04:05:58,303 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Discovery state after looking for membership coordinator is 
locatorsContacted=1; findInViewResponses=0; 
alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; 
registrants=[10.196.55.142(19002)<ec>:42000]; 
possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; 
hasContactedAJoinedLocator=true; 
view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]; responses=[]
2021-11-28 04:05:58,305 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:05:58,305 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting 
for a join-response
{noformat}
* This message is coming from {*}GMSJoinLeave.attemptToJoin{*}. This fails 
since the ReconnectThread continues on.

node1-vm.log contains these messages from 
*GMSLocator.processFindCoordinatorRequest* corresponding to the 
*FindCoordinatorRequests* received from node2:
{noformat}
2021-11-28 04:05:06,898 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator
 request thread 1] [] Peer locator: coordinator from view is 
10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:05:23,045 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator
 request thread 1] [] Peer locator: coordinator from view is 
10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:05:39,141 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator
 request thread 1] [] Peer locator: coordinator from registrations is 
10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:05:55,238 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator
 request thread 1] [] Peer locator: coordinator from registrations is 
10.196.55.141(15661)<ec><v0>:42000
{noformat}
Since, node2 was unable to join the distributed system, it gave up:
{noformat}
2021-11-28 04:06:10,309 WARN  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Caught SystemConnectException in reconnect
org.apache.geode.SystemConnectException: Problem starting up membership services
        at 
org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187)
 ~[geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811)
 ~[geode-membership-1.14.0.jar:?]
Caused by: 
org.apache.geode.distributed.internal.membership.api.MemberStartupException: 
Unable to join the distributed system in 68228ms
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.join(GMSJoinLeave.java:411)
 ~[geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.join(GMSMembership.java:533)
 ~[geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.access$1200(GMSMembership.java:72)
 ~[geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.joinDistributedSystem(GMSMembership.java:1752)
 ~[geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:242)
 ~[geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642)
 ~[geode-membership-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
 ~[geode-core-1.14.0.jar:?]
{noformat}
Node2 Reconnect Attempt 2
{noformat}
2021-11-28 04:07:10,310 INFO 
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Attempting to reconnect to the distributed system. This is attempt 
#2.{noformat}
The second retry attempt succeeded in getting quorum (this means node2 was able 
to send a ping message to node1 and receive a pong message back:
{noformat}
2021-11-28 04:07:10,319 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] performing a quorum check to see if location services can be started early
2021-11-28 04:07:10,319 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] beginning quorum check with GMSQuorumChecker on view 
View[10.196.55.141(15661)<ec><v0>:42000|1] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}, 10.196.55.142(19002)<ec><v1>:42000]
2021-11-28 04:07:10,319 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: sending request to 10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:07:10,320 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: waiting up to 15000ms to receive a quorum of responses
2021-11-28 04:07:10,320 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[unicast
 receiver,vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2-26663] [] 
received ping-pong response from /10.196.55.141<v0>:42000
2021-11-28 04:07:10,321 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[unicast
 receiver,vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2-26663] [] 
quorum check: mapped address to member ID 10.196.55.141(15661)<ec><v0>:42000
 2021-11-28 04:07:10,820 INFO  
[org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread]
 [] quorum check: received responses from all members that were in the old 
distributed system
{noformat}
It then started the locator:
{noformat}
2021-11-28 04:07:10,820 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Quorum check passed - allowing location services to start early
2021-11-28 04:07:10,879 INFO  
[org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] 
Starting peer location for Distribution Locator on 
vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142
2021-11-28 04:07:10,879 INFO  
[org.apache.geode.distributed.internal.tcpserver.TcpServer]-[ReconnectThread] 
[] Locator was created at Sun Nov 28 04:07:10 UTC 2021
2021-11-28 04:07:10,879 INFO  
[org.apache.geode.distributed.internal.tcpserver.TcpServer]-[ReconnectThread] 
[] Listening on port 10335 bound on address 
vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142
{noformat}
And recovered state from node1's locator (this means node2 was able to send a 
*GetViewRequest* to node1 and receive a *GetViewResponse* back):
{noformat}
2021-11-28 04:07:10,880 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread]
 [] GemFire peer location service starting.  Other locators: 
10.196.55.142[10335],10.196.55.141[10335],10.196.55.133[10335]  Locators 
preferred as coordinators: true  Network partition detection enabled: true  
View persistence file: /locator10335view.dat
2021-11-28 04:07:10,898 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread]
 [] Peer locator attempting to recover from HostAndPort[/10.196.55.141:10335]
2021-11-28 04:07:10,965 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread]
 [] Peer locator recovered initial membership of 
View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]
2021-11-28 04:07:10,966 INFO  
[org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread]
 [] Peer locator recovered state from HostAndPort[/10.196.55.141:10335]
{noformat}
It then had the same behavior with joining the distributed system as reconnect 
attempt 1:
{noformat}
2021-11-28 04:07:11,153 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] received 
FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, 
fromView=true, viewId=-100, registrants=[10.196.55.142(19002)<ec>:42000], 
senderId=10.196.55.142(19002)<ec>:42000, network partition detection 
enabled=true, locators preferred as coordinators=true, 
view=View[10.196.55.141(15661)<ec><v0>:42000|-100] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]) from locator 
HostAndPort[/10.196.55.142:10335]
2021-11-28 04:07:11,224 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] received 
FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, 
fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000, 
10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, 
network partition detection enabled=true, locators preferred as 
coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]) from locator 
HostAndPort[/10.196.55.141:10335]
2021-11-28 04:07:11,226 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Locator's address indicates it is part of a distributed system so I will 
not become membership coordinator on this attempt to join
2021-11-28 04:07:14,303 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these 
possible coordinators: [10.196.55.141(15661)<ec><v0>:42000]
2021-11-28 04:07:14,304 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Discovery state after looking for membership coordinator is 
locatorsContacted=2; findInViewResponses=0; alreadyTried=[]; registrants=[]; 
possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; 
hasContactedAJoinedLocator=true; 
view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: 
[10.196.55.141(15661)<ec><v0>:42000{lead}]  crashed: 
[10.196.55.142(19002)<ec><v1>:42000]; responses=[]
2021-11-28 04:07:14,305 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000
2021-11-28 04:07:14,305 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Attempting to join the distributed system through coordinator 
10.196.55.141(15661)<ec><v0>:42000 using address 10.196.55.142(19002)<ec>:42000

2021-11-28 04:07:29,409 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting 
for a join-response

2021-11-28 04:07:44,516 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting 
for a join-response

2021-11-28 04:07:59,617 INFO  
[org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread]
 [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting 
for a join-response
{noformat}
Since, it was unable to join the distributed system, it gave up:
{noformat}
2021-11-28 04:08:11,622 WARN  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Caught SystemConnectException in reconnect
{noformat}
*However, the locator was not stopped.*
h3. Node2 Reconnect Attempt 3
{noformat}
2021-11-28 04:09:11,623 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Attempting to reconnect to the distributed system.  This is attempt #3.
{noformat}
The third retry attempt also succeeded in getting quorum:
{noformat}
2021-11-28 04:09:11,631 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] performing a quorum check to see if location services can be started early
2021-11-28 04:09:11,631 INFO  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Quorum check passed - allowing location services to start early
{noformat}
Then, the exception occurred trying to start the locator since it was still 
running from reconnect attempt 2:
{noformat}
2021-11-28 04:09:11,632 WARN  
[org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
 [] Exception occurred while trying to connect the system during reconnect
java.lang.IllegalStateException: A locator can not be created because one 
already exists in this JVM.
        at 
org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
 ~[geode-core-1.14.0.jar:?]
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
 ~[geode-core-1.14.0.jar:?]
{noformat}
I'm not sure why the servers were able to exchange some messages (e.g. 
{*}FindCoordinatorRequest/Response{*}, {*}GetViewRequest/Response{*}) but not 
others (e.g. {*}JoinRequestMessage/JoinResponseMessage{*}), but that could be 
just the state of the servers at the time.

In any event, if a reconnect attempt fails, any started locator should be 
stopped.

> Failure to auto-reconnect upon network partition
> ------------------------------------------------
>
>                 Key: GEODE-9910
>                 URL: https://issues.apache.org/jira/browse/GEODE-9910
>             Project: Geode
>          Issue Type: Bug
>    Affects Versions: 1.14.0
>            Reporter: Surya Mudundi
>            Assignee: Barrett Oglesby
>            Priority: Major
>              Labels: GeodeOperationAPI, blocks-1.15.0​, needsTriage
>         Attachments: geode-logs.zip
>
>
> Two node cluster with embedded locators failed to auto-reconnect when node-1 
> experienced network outage for couple of minutes and when node-1 recovered 
> from the outage, node-2 failed to auto-reconnect.
> node-2 tried to re-connect to node-1 as:
> [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
>  [] Attempting to reconnect to the distributed system.  This is attempt #1.
> [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
>  [] Attempting to reconnect to the distributed system.  This is attempt #2.
> [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
>  [] Attempting to reconnect to the distributed system.  This is attempt #3.
> Finally reported below error after 3 attempts as:
> INFO  
> [org.apache.geode.logging.internal.LoggingProviderLoader]-[ReconnectThread] 
> [] Using org.apache.geode.logging.internal.SimpleLoggingProvider for service 
> org.apache.geode.logging.internal.spi.LoggingProvider
> INFO  [org.apache.geode.internal.InternalDataSerializer]-[ReconnectThread] [] 
> initializing InternalDataSerializer with 0 services
> INFO  
> [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
>  [] performing a quorum check to see if location services can be started early
> INFO  
> [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
>  [] Quorum check passed - allowing location services to start early
> WARN  
> [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread]
>  [] Exception occurred while trying to connect the system during reconnect
> java.lang.IllegalStateException: A locator can not be created because one 
> already exists in this JVM.
>         at 
> org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326)
>  ~[geode-core-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187)
>  ~[geode-membership-1.14.0.jar:?]
>         at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811)
>  ~[geode-membership-1.14.0.jar:?]
>         at java.lang.Thread.run(Thread.java:829) [?:?]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to