[ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated GEODE-9402:
----------------------------------
    Labels: pull-request-available  (was: )

> Automatic Reconnect Failure: Address already in use
> ---------------------------------------------------
>
>                 Key: GEODE-9402
>                 URL: https://issues.apache.org/jira/browse/GEODE-9402
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Juan Ramos
>            Assignee: Jianxia Chen
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and 
> running the test drops the network connectivity between all members to 
> generate a full network partition and cause all members to shutdown and go 
> into reconnect mode. Upon reaching the mentioned state, the test 
> automatically restores the network connectivity and expects all members to 
> automatically go up again and re-form the distributed system.
>  This works fine most of the time, and we see every member successfully 
> reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread> 
> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread> 
> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread> 
> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread> 
> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread> 
> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread> 
> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect 
> phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread> 
> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = 
> false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server 
> = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed 
> because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer 
> on port=40404 client subscription config policy=none client subscription 
> config capacity=1 client subscription config overflow directory=.
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
>       at 
> org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
>       at 
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
>       at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
>       at 
> java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
>       at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
>       at 
> org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
>       at 
> org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
>       ... 14 more
> {noformat}
> It seems that the server is trying to bind the port before the old instance 
> has finished shutting down and cleaning up resources, causing the reconnect 
> process to halt and stop re-trying, and leaving the cluster with one less 
> member.
> We've been able to reproduce the problem only twice in the past few weeks, 
> I've attached the two set of artefacts to the ticket:
>  - _*cluster_logs_pks_121*_: the member that throws the {{BindException}} 
> during reconnect is {{gemfire-cluster-server-1}}.
>  - _*cluster_logs_gke_latest_54*_: the member that throws the 
> {{BindException}} during reconnect is {{gemfire-cluster-server-0}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to