Dan Smith created GEODE-5358:
--------------------------------

             Summary: Interrupting a thread writing to a socket can result in a 
hang due to a lost message
                 Key: GEODE-5358
                 URL: https://issues.apache.org/jira/browse/GEODE-5358
             Project: Geode
          Issue Type: Bug
          Components: messaging
            Reporter: Dan Smith


If a thread doing a geode operation is interrupted, it can result in the system 
hanging waiting for a a reply. I have a dunit test that demonstrates this issue 
which interrupts a thread while we are doing function execution. The system is 
then stuck waiting for replies
{noformat}
  [vm0] [warn 2018/06/28 11:14:13.715 PDT <Thread-264> tid=454] 15 seconds have 
elapsed while waiting for replies: <FunctionStreamingResultCollector 11084 
waiting for 1 replies from [10.118.20.71(server-1:90978)<v6>:32771]> on 
10.118.20.71(server-0:90977)<v5>:32770 whose current membership list is: 
[[10.118.20.71(server-1:90978)<v6>:32771, 
10.118.20.71(90975:locator)<ec><v0>:32769, 
10.118.20.71(server-0:90977)<v5>:32770]]

"Thread-264" #454 daemon prio=5 os_prio=31 tid=0x00007fd30b9f8000 nid=0x8727 
waiting on condition [0x000070000b300000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007b8c10360> (a 
java.util.concurrent.CountDownLatch$Sync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
        at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:61)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:714)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:789)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:765)
        at 
org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:139)
        at 
org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.executeFunction(InterruptTcpConduitDUnitTest.java:91)
        at 
org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.lambda$doInterruptTest$1(InterruptTcpConduitDUnitTest.java:67)
        at 
org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest$$Lambda$68/1495662507.run(Unknown
 Source)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

I think what is going on here is that there are two threads that write messages 
to the same socket. If the second thread is interrupted, that causes an 
ClosedByInterruptException and closes the socket. That can cause a message from 
the first thread to be lost, because the socket is closed. The system will then 
hang.

A suggested fix would be to implement a layer that can replay a certain window 
of sent messages if a tcp connection between peers is lost and reestablished.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to