Dan Smith created GEODE-5358: -------------------------------- Summary: Interrupting a thread writing to a socket can result in a hang due to a lost message Key: GEODE-5358 URL: https://issues.apache.org/jira/browse/GEODE-5358 Project: Geode Issue Type: Bug Components: messaging Reporter: Dan Smith
If a thread doing a geode operation is interrupted, it can result in the system hanging waiting for a a reply. I have a dunit test that demonstrates this issue which interrupts a thread while we are doing function execution. The system is then stuck waiting for replies {noformat} [vm0] [warn 2018/06/28 11:14:13.715 PDT <Thread-264> tid=454] 15 seconds have elapsed while waiting for replies: <FunctionStreamingResultCollector 11084 waiting for 1 replies from [10.118.20.71(server-1:90978)<v6>:32771]> on 10.118.20.71(server-0:90977)<v5>:32770 whose current membership list is: [[10.118.20.71(server-1:90978)<v6>:32771, 10.118.20.71(90975:locator)<ec><v0>:32769, 10.118.20.71(server-0:90977)<v5>:32770]] "Thread-264" #454 daemon prio=5 os_prio=31 tid=0x00007fd30b9f8000 nid=0x8727 waiting on condition [0x000070000b300000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007b8c10360> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:61) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:714) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:789) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:765) at org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:139) at org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.executeFunction(InterruptTcpConduitDUnitTest.java:91) at org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.lambda$doInterruptTest$1(InterruptTcpConduitDUnitTest.java:67) at org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest$$Lambda$68/1495662507.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) {noformat} I think what is going on here is that there are two threads that write messages to the same socket. If the second thread is interrupted, that causes an ClosedByInterruptException and closes the socket. That can cause a message from the first thread to be lost, because the socket is closed. The system will then hang. A suggested fix would be to implement a layer that can replay a certain window of sent messages if a tcp connection between peers is lost and reestablished. -- This message was sent by Atlassian JIRA (v7.6.3#76005)