Thanks for the bug report Vahram! Anthony
> On Apr 11, 2019, at 3:36 PM, Bruce Schuchardt <[email protected]> wrote: > > https://github.com/apache/geode/pull/3449 > <https://github.com/apache/geode/pull/3449> > On 4/11/19 3:28 PM, Bruce Schuchardt wrote: >> I've reopened GEODE-3948 to address this Vahram. I'll have a pull request >> up shortly. >> >> On 4/11/19 8:06 AM, Vahram Aharonyan wrote: >>> Hi All, >>> >>> We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along >>> with Geode Server each VM has one Geode 1.7 Client. Hence we have 2 >>> servers and 2 clients in Geode cluster. >>> >>> While doing validation, we have introduced packet loss(~65%) on first VM >>> “A” and after about 1 minute client of VM “B” reports following: >>> >>> [warning 2019/04/11 16:20:27.502 AMT >>> Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] >>> Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been >>> stuck for <76.204 seconds> and number of thread monitor iteration <1> >>> Thread Name <poolTimer-CollectorControllerPool-142> >>> Thread state <RUNNABLE> >>> Executor Group <ScheduledThreadPoolExecutorWithKeepAlive> >>> Monitored metric <ResourceManagerStats.numThreadsStuck> >>> Thread Stack: >>> java.net.SocketInputStream.socketRead0(Native Method) >>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116) >>> java.net.SocketInputStream.read(SocketInputStream.java:171) >>> java.net.SocketInputStream.read(SocketInputStream.java:141) >>> sun.security.ssl.InputRecord.readFully(InputRecord.java:465) >>> sun.security.ssl.InputRecord.read(InputRecord.java:503) >>> sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) >>> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) >>> sun.security.ssl.AppInputStream.read(AppInputStream.java:105) >>> >>> org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809) >>> >>> org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659) >>> >>> org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124) >>> >>> org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135) >>> >>> org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205) >>> >>> org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386) >>> >>> org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276) >>> >>> org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167) >>> >>> org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894) >>> >>> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387) >>> >>> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349) >>> >>> org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827) >>> org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36) >>> >>> org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90) >>> >>> org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338) >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) >>> >>> org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271) >>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> java.lang.Thread.run(Thread.java:748) >>> >>> This report and stacktrace is being continuously repeated by ThreadsMOnitor >>> over time – just iteration count and “stuck for” values are increasing. >>> From stacktrace it seems to be PingOperation initiated by client on VM “B” >>> to Server of VM “A”. Due to packet drop between the nodes the response is >>> not reaching caller client from the server and this thread remaines blocked >>> for hours. In source I see that receiveWithHeaderReadTimeout receives >>> NO_HEADER_READ_TIMEOUT as a timeout argument which means we will wait >>> indefinitely. Is this reasonable? So the question is why PingOperation is >>> executed without timeout? >>> >>> Or could it be that this stacked thread will be interrupted by some >>> monitoring logic at some moment? >>> >>> Thanks, >>> Vahram.
