Hi Bruce, Anthony, Just seeking your confirmation that this change is ok to be backported to 1.7.0 as is. Or should I take care of some other stuff as well?
Thanks, Vahram. From: Vahram Aharonyan <[email protected]> Sent: Friday, April 12, 2019 11:37 AM To: [email protected] Subject: RE: Stucked thread after network outage Hi All, Thanks for your feedbacks. I will proceed with backporting the fix to version that we are using. Best Regards, Vahram. From: Anthony Baker <[email protected]<mailto:[email protected]>> Sent: Friday, April 12, 2019 3:58 AM To: [email protected]<mailto:[email protected]> Subject: Re: Stucked thread after network outage Thanks for the bug report Vahram! Anthony On Apr 11, 2019, at 3:36 PM, Bruce Schuchardt <[email protected]<mailto:[email protected]>> wrote: https://github.com/apache/geode/pull/3449<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F3449&data=02%7C01%7Cvaharonyan%40vmware.com%7C11ba1e112b8f4c0655e408d6bf19b28c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636906514428601677&sdata=DDGNHutIBAG3JGvpxPpYZxfWZ0JAugY6bQDWAYM5TPw%3D&reserved=0> On 4/11/19 3:28 PM, Bruce Schuchardt wrote: I've reopened GEODE-3948 to address this Vahram. I'll have a pull request up shortly. On 4/11/19 8:06 AM, Vahram Aharonyan wrote: Hi All, We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along with Geode Server each VM has one Geode 1.7 Client. Hence we have 2 servers and 2 clients in Geode cluster. While doing validation, we have introduced packet loss(~65%) on first VM “A” and after about 1 minute client of VM “B” reports following: [warning 2019/04/11 16:20:27.502 AMT Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been stuck for <76.204 seconds> and number of thread monitor iteration <1> Thread Name <poolTimer-CollectorControllerPool-142> Thread state <RUNNABLE> Executor Group <ScheduledThreadPoolExecutorWithKeepAlive> Monitored metric <ResourceManagerStats.numThreadsStuck> Thread Stack: java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.socketRead(SocketInputStream.java:116) java.net.SocketInputStream.read(SocketInputStream.java:171) java.net.SocketInputStream.read(SocketInputStream.java:141) sun.security.ssl.InputRecord.readFully(InputRecord.java:465) sun.security.ssl.InputRecord.read(InputRecord.java:503) sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) sun.security.ssl.AppInputStream.read(AppInputStream.java:105) org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809) org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659) org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124) org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135) org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205) org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386) org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276) org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167) org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894) org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387) org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349) org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827) org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36) org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90) org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) This report and stacktrace is being continuously repeated by ThreadsMOnitor over time – just iteration count and “stuck for” values are increasing. From stacktrace it seems to be PingOperation initiated by client on VM “B” to Server of VM “A”. Due to packet drop between the nodes the response is not reaching caller client from the server and this thread remaines blocked for hours. In source I see that receiveWithHeaderReadTimeout receives NO_HEADER_READ_TIMEOUT as a timeout argument which means we will wait indefinitely. Is this reasonable? So the question is why PingOperation is executed without timeout? Or could it be that this stacked thread will be interrupted by some monitoring logic at some moment? Thanks, Vahram.
