Lynn Gallinat created GEODE-9425:
------------------------------------
Summary: AutoConnectionSource thread in client can't query for
available locators when it is connected to a locator that was shut down
Key: GEODE-9425
URL: https://issues.apache.org/jira/browse/GEODE-9425
Project: Geode
Issue Type: Bug
Components: client/server
Affects Versions: 1.15.0
Reporter: Lynn Gallinat
The AutoConnectionSource thread runs in a client and queries the locator that
client is connected to so it can update the list of available locators.
But if the locator the client is connected to was shut down, the client can't
get an updated locator list.
In this case the locator was shut down and is not coming back, but there is
another available locator.
However we can't find out what that available locator is because we can't
complete the query.
To summarize: The AutoConnectionSource thread that runs in a client to update
the list of available locators should be able to get a list of available
locators even when that client is connected to a locator that was shut down.
The AutoConnectionSource thread starts and runs every 10 seconds. This is from
the client's system log.
[info 2021/07/07 19:37:33.723 GMT clientgemfire1_host1_881
<vm_0_thr_0_client1_host1_881> tid=0x2d] AutoConnectionSource
UpdateLocatorListTask started with interval=10000 ms.
After the locator is shut down the AutoConnectionSource thread can't complete
its work so we get stuck threads.
This stuck thread stack shows it is trying to run UpdateLocatorListTask.
{noformat}
clientgemfire1_881/system.log: [warn 2021/07/07 19:47:25.784 GMT
clientgemfire1_host1_881 <ThreadsMonitor> tid=0x36] Thread <286> (0x11e) that
was executed at <07 Jul 2021 19:46:03 GMT> has been stuck for <82.041 seconds>
and number of thread monitor iteration <1>
Thread Name <poolTimer-pool-24> state <RUNNABLE>
Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack for "poolTimer-pool-24" (0x11e):
java.lang.ThreadState: RUNNABLE (in native)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
- locked java.net.SocksSocketImpl@3e95a505
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
at
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
at
org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96)
at
org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246)
at
org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.access$200(AutoConnectionSourceImpl.java:68)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl$UpdateLocatorListTask.run2(AutoConnectionSourceImpl.java:458)
at
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:285)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- java.util.concurrent.ThreadPoolExecutor$Worker@24cd39b5
{noformat}
Impact on running cache operations:
Any operations in progress by the client connected to a locator that was shut
down can take 59 seconds to complete, which is the default socket connect
timeout.
From org.apache.geode.cache.client.PoolFactory:
int DEFAULT_SOCKET_CONNECT_TIMEOUT = 59000;
If the client's connection pool has multiple locators listed, then we can see a
59 second timeout for each of those that has been shut down so an operation
could take minutes to complete.
Here is a stack for such a timeout:
{noformat}
clientgemfire1_881/system.log: [info 2021/07/07 19:48:57.936 GMT
clientgemfire1_host1_881 <queueTimer-pool1> tid=0x85] Communication with
locator
gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334
failed
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
at
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
at
org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96)
at
org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246)
at
org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254)
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServersForQueue(AutoConnectionSourceImpl.java:188)
at
org.apache.geode.cache.client.internal.QueueManagerImpl.findQueueServers(QueueManagerImpl.java:823)
at
org.apache.geode.cache.client.internal.QueueManagerImpl.recoverRedundancy(QueueManagerImpl.java:640)
at
org.apache.geode.cache.client.internal.QueueManagerImpl.access$700(QueueManagerImpl.java:77)
at
org.apache.geode.cache.client.internal.QueueManagerImpl$RedundancySatisfierTask.run2(QueueManagerImpl.java:1449)
at
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
Then 59 seconds later there can be another timeout while the client tries to
talk to a different locator that was shut down.
Notes on how to reproduce:
Need servers and at least 2 locators.
Need a client connected to a locator.
This is the client's pool configuration from my example where there are 4
locators to start with, and all are running.
{noformat}
[info 2021/07/07 19:37:33.715 GMT <vm_0_thr_0_client1_host1_881> tid=0x2d]
Creating pool named: pool with attributes
freeConnectionTimeout: 10000
idleTimeout: 5000
loadConditioningInterval: 5000
locators:
[gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.17.239:10334,
gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.22.157:10334,
gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.10.237:10334,
gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334]
minConnections: 2
maxConnections: -1
multiuserAuthentication: false
pingInterval: 10000
poolName: pool
prSingleHopEnabled: true
readTimeout: 3600000
retryAttempts: -1
servers: []
serverGroup:
socketBufferSize: 32768
statisticInterval: -1
subscriptionAckInterval: 100
subscriptionEnabled: true
subscriptionMessageTrackingTimeout: 900000
subscriptionRedundancy: 1
{noformat}
After a client connects to the locator, that locator is shut down.
In this example, 3 locators were shutdown so the only one left was
gemfire-cluster-locator-0.
Note that the view in the system log for gemfire-cluster-locator-0 shows that
it knows there is now only one locator.
{noformat}
[info 2021/07/07 19:38:04.420 GMT gemfire-cluster-locator-0 <Geode Membership
View Creator> tid=0xc8] sending new view
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000|16]
members:
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000,
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v3>:41000{lead},
gemfire-cluster-server-5(gemfire-cluster-server-5:1)<v4>:41000,
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v6>:41000,
gemfire-cluster-server-4(gemfire-cluster-server-4:1)<v7>:41000,
gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v8>:41000,
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v9>:41000] shutdown:
[gemfire-cluster-server-6(gemfire-cluster-server-6:1)<v5>:41000]
{noformat}
After an appropriate amount of time the stuck thread stack appears in the
client's system.log.
If operations were in progress in that client, they timeout after 59 seconds
and try another locator, possibly another stopped locator and then it waits
another 59 seconds.
This can repeat for each locator that was shut down making an operation take
minutes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)