Re: Stucked thread after network outage

Anthony Baker Thu, 11 Apr 2019 16:58:06 -0700

Thanks for the bug report Vahram!

Anthony



> On Apr 11, 2019, at 3:36 PM, Bruce Schuchardt <[email protected]> wrote:
> 
> https://github.com/apache/geode/pull/3449 
> <https://github.com/apache/geode/pull/3449>
> On 4/11/19 3:28 PM, Bruce Schuchardt wrote:
>> I've reopened GEODE-3948 to address this Vahram.  I'll have a pull request 
>> up shortly.
>> 
>> On 4/11/19 8:06 AM, Vahram Aharonyan wrote:
>>> Hi All,
>>>  
>>> We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along 
>>> with Geode Server each VM has one Geode 1.7 Client. Hence we have  2 
>>> servers and 2 clients in Geode cluster. 
>>>  
>>> While doing validation, we have introduced packet loss(~65%) on first VM 
>>> “A” and after about 1 minute client of VM “B” reports following:
>>>  
>>> [warning 2019/04/11 16:20:27.502 AMT 
>>> Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] 
>>> Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been 
>>> stuck for <76.204 seconds> and number of thread monitor iteration <1>
>>>   Thread Name <poolTimer-CollectorControllerPool-142>
>>>   Thread state <RUNNABLE>
>>>   Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
>>>   Monitored metric <ResourceManagerStats.numThreadsStuck>
>>>   Thread Stack:
>>>   java.net.SocketInputStream.socketRead0(Native Method)
>>>   java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>>>   java.net.SocketInputStream.read(SocketInputStream.java:171)
>>>   java.net.SocketInputStream.read(SocketInputStream.java:141)
>>>   sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
>>>   sun.security.ssl.InputRecord.read(InputRecord.java:503)
>>>   sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
>>>   sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
>>>   sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
>>>   
>>> org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809)
>>>   
>>> org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659)
>>>  
>>> org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124)
>>>   
>>> org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135)
>>>   
>>> org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205)
>>>   
>>> org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386)
>>>   
>>> org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276)
>>>   
>>> org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167)
>>>  
>>> org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894)
>>>   
>>> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387)
>>>   
>>> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349)
>>>   
>>> org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827)
>>>   org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)
>>>   
>>> org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
>>>   
>>> org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338)
>>>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>   java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>>  
>>> org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271)
>>>   
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>   
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>   java.lang.Thread.run(Thread.java:748)
>>>  
>>> This report and stacktrace is being continuously repeated by ThreadsMOnitor 
>>> over time – just iteration count and “stuck for” values are increasing. 
>>> From stacktrace it seems to be PingOperation initiated by client on VM “B” 
>>> to Server of VM “A”. Due to packet drop between the nodes the response is 
>>> not reaching caller client from the server and this thread remaines blocked 
>>> for hours. In source I see that receiveWithHeaderReadTimeout receives 
>>> NO_HEADER_READ_TIMEOUT as a timeout argument which means we will wait 
>>> indefinitely. Is this reasonable? So the question is why PingOperation is 
>>> executed without timeout? 
>>>  
>>> Or could it be that this stacked thread will be interrupted by some 
>>> monitoring logic at some moment?
>>>  
>>> Thanks,
>>> Vahram.

Re: Stucked thread after network outage

Reply via email to