[ 
https://issues.apache.org/jira/browse/GEODE-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526779#comment-16526779
 ] 

Dan Smith commented on GEODE-5358:
----------------------------------

Attaching a test that reproduces this issue. This test doesn't hang every 
single time, but it does hang fairly frequently with this issue.

[^GEODE-5358.diff]

> Interrupting a thread writing to a socket can result in a hang due to a lost 
> message
> ------------------------------------------------------------------------------------
>
>                 Key: GEODE-5358
>                 URL: https://issues.apache.org/jira/browse/GEODE-5358
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Dan Smith
>            Priority: Major
>         Attachments: GEODE-5358.diff
>
>
> If a thread doing a geode operation is interrupted, it can result in the 
> system hanging waiting for a a reply. I have a dunit test that demonstrates 
> this issue which interrupts a thread while we are doing function execution. 
> The system is then stuck waiting for replies
> {noformat}
>   [vm0] [warn 2018/06/28 11:14:13.715 PDT <Thread-264> tid=454] 15 seconds 
> have elapsed while waiting for replies: <FunctionStreamingResultCollector 
> 11084 waiting for 1 replies from [10.118.20.71(server-1:90978)<v6>:32771]> on 
> 10.118.20.71(server-0:90977)<v5>:32770 whose current membership list is: 
> [[10.118.20.71(server-1:90978)<v6>:32771, 
> 10.118.20.71(90975:locator)<ec><v0>:32769, 
> 10.118.20.71(server-0:90977)<v5>:32770]]
> "Thread-264" #454 daemon prio=5 os_prio=31 tid=0x00007fd30b9f8000 nid=0x8727 
> waiting on condition [0x000070000b300000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000007b8c10360> (a 
> java.util.concurrent.CountDownLatch$Sync)
>       at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>       at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>       at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:61)
>       at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:714)
>       at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:789)
>       at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:765)
>       at 
> org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:139)
>       at 
> org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.executeFunction(InterruptTcpConduitDUnitTest.java:91)
>       at 
> org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.lambda$doInterruptTest$1(InterruptTcpConduitDUnitTest.java:67)
>       at 
> org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest$$Lambda$68/1495662507.run(Unknown
>  Source)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I think what is going on here is that there are two threads that write 
> messages to the same socket. If the second thread is interrupted, that causes 
> an ClosedByInterruptException and closes the socket. That can cause a message 
> from the first thread to be lost, because the socket is closed. The system 
> will then hang.
> A suggested fix would be to implement a layer that can replay a certain 
> window of sent messages if a tcp connection between peers is lost and 
> reestablished.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to