Bruce J Schuchardt created GEODE-8467:
-----------------------------------------

             Summary: server fails to notify of a ForcedDisconnect and fails to 
tear down the cache
                 Key: GEODE-8467
                 URL: https://issues.apache.org/jira/browse/GEODE-8467
             Project: Geode
          Issue Type: Bug
          Components: membership
    Affects Versions: 1.12.0, 1.11.0, 1.10.0, 1.13.0, 1.14.0
            Reporter: Bruce J Schuchardt


A test having auto-reconnect enabled failed while restarting a server and hung. 
 The restarting server was building its cache when it was kicked out of the 
cluster due to very high load on the test machine.  Membership initiated a 
forced-disconnect
{noformat}
[fatal 2020/08/22 00:51:04.508 PDT <unicast 
receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
Membership service failure: Member isn't responding to heartbeat requests
org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
 Member isn't responding to heartbeat requests
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2012)
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1085)
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:688)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1331)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1267)
 {noformat}
 

and then logged that it was generating a description of the cache
{noformat}
[info 2020/08/22 00:51:05.933 PDT <unicast 
receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
generating XML to rebuild the cache after reconnect completes {noformat}
 

but it never logged completion of this step and never forked a thread to tear 
down the cache.  Any exception thrown by XML generation would have been caught 
by JGroups code, which logs the problem at a WARNING level.  We have JGroups 
logging set to FATAL level so you wouldn't see the issue.

We need to add exception handling around XML generation and, if detected, 
disable reconnect attempts and have the server shut down.

The bug isn't easy to hit.  I've run the test that failed over 5000 times 
without encountering it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to