Hi Bruce, Anthony,

Just seeking your confirmation that this change is ok to be backported to 1.7.0 
as is. Or should I take care of some other stuff as well?

Thanks,
Vahram.

From: Vahram Aharonyan <[email protected]>
Sent: Friday, April 12, 2019 11:37 AM
To: [email protected]
Subject: RE: Stucked thread after network outage

Hi All,

Thanks for your feedbacks.
I will proceed with backporting the fix to version that we are using.

Best Regards,
Vahram.

From: Anthony Baker <[email protected]<mailto:[email protected]>>
Sent: Friday, April 12, 2019 3:58 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Stucked thread after network outage

Thanks for the bug report Vahram!

Anthony


On Apr 11, 2019, at 3:36 PM, Bruce Schuchardt 
<[email protected]<mailto:[email protected]>> wrote:

https://github.com/apache/geode/pull/3449<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F3449&data=02%7C01%7Cvaharonyan%40vmware.com%7C11ba1e112b8f4c0655e408d6bf19b28c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636906514428601677&sdata=DDGNHutIBAG3JGvpxPpYZxfWZ0JAugY6bQDWAYM5TPw%3D&reserved=0>
On 4/11/19 3:28 PM, Bruce Schuchardt wrote:
I've reopened GEODE-3948 to address this Vahram.  I'll have a pull request up 
shortly.
On 4/11/19 8:06 AM, Vahram Aharonyan wrote:
Hi All,

We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along 
with Geode Server each VM has one Geode 1.7 Client. Hence we have  2 servers 
and 2 clients in Geode cluster.

While doing validation, we have introduced packet loss(~65%) on first VM “A” 
and after about 1 minute client of VM “B” reports following:

[warning 2019/04/11 16:20:27.502 AMT 
Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] 
Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been stuck 
for <76.204 seconds> and number of thread monitor iteration <1>
  Thread Name <poolTimer-CollectorControllerPool-142>
  Thread state <RUNNABLE>
  Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
  Monitored metric <ResourceManagerStats.numThreadsStuck>
  Thread Stack:
  java.net.SocketInputStream.socketRead0(Native Method)
  java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
  java.net.SocketInputStream.read(SocketInputStream.java:171)
  java.net.SocketInputStream.read(SocketInputStream.java:141)
  sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
  sun.security.ssl.InputRecord.read(InputRecord.java:503)
  sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
  sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
  sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
  
org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809)
  
org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659)
 
org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124)
  
org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135)
  
org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205)
  org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386)
  
org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276)
  
org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167)
 
org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894)
  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387)
  
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349)
  org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827)
  org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)
  
org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
  
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338)
  java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271)
  
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  java.lang.Thread.run(Thread.java:748)

This report and stacktrace is being continuously repeated by ThreadsMOnitor 
over time – just iteration count and “stuck for” values are increasing. From 
stacktrace it seems to be PingOperation initiated by client on VM “B” to Server 
of VM “A”. Due to packet drop between the nodes the response is not reaching 
caller client from the server and this thread remaines blocked for hours. In 
source I see that receiveWithHeaderReadTimeout receives NO_HEADER_READ_TIMEOUT 
as a timeout argument which means we will wait indefinitely. Is this 
reasonable? So the question is why PingOperation is executed without timeout?

Or could it be that this stacked thread will be interrupted by some monitoring 
logic at some moment?

Thanks,
Vahram.

Reply via email to