[ https://issues.apache.org/jira/browse/GEODE-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526779#comment-16526779 ]
Dan Smith commented on GEODE-5358: ---------------------------------- Attaching a test that reproduces this issue. This test doesn't hang every single time, but it does hang fairly frequently with this issue. [^GEODE-5358.diff] > Interrupting a thread writing to a socket can result in a hang due to a lost > message > ------------------------------------------------------------------------------------ > > Key: GEODE-5358 > URL: https://issues.apache.org/jira/browse/GEODE-5358 > Project: Geode > Issue Type: Bug > Components: messaging > Reporter: Dan Smith > Priority: Major > Attachments: GEODE-5358.diff > > > If a thread doing a geode operation is interrupted, it can result in the > system hanging waiting for a a reply. I have a dunit test that demonstrates > this issue which interrupts a thread while we are doing function execution. > The system is then stuck waiting for replies > {noformat} > [vm0] [warn 2018/06/28 11:14:13.715 PDT <Thread-264> tid=454] 15 seconds > have elapsed while waiting for replies: <FunctionStreamingResultCollector > 11084 waiting for 1 replies from [10.118.20.71(server-1:90978)<v6>:32771]> on > 10.118.20.71(server-0:90977)<v5>:32770 whose current membership list is: > [[10.118.20.71(server-1:90978)<v6>:32771, > 10.118.20.71(90975:locator)<ec><v0>:32769, > 10.118.20.71(server-0:90977)<v5>:32770]] > "Thread-264" #454 daemon prio=5 os_prio=31 tid=0x00007fd30b9f8000 nid=0x8727 > waiting on condition [0x000070000b300000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000007b8c10360> (a > java.util.concurrent.CountDownLatch$Sync) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > at > org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:61) > at > org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:714) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:789) > at > org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:765) > at > org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:139) > at > org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.executeFunction(InterruptTcpConduitDUnitTest.java:91) > at > org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.lambda$doInterruptTest$1(InterruptTcpConduitDUnitTest.java:67) > at > org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest$$Lambda$68/1495662507.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) > {noformat} > I think what is going on here is that there are two threads that write > messages to the same socket. If the second thread is interrupted, that causes > an ClosedByInterruptException and closes the socket. That can cause a message > from the first thread to be lost, because the socket is closed. The system > will then hang. > A suggested fix would be to implement a layer that can replay a certain > window of sent messages if a tcp connection between peers is lost and > reestablished. -- This message was sent by Atlassian JIRA (v7.6.3#76005)