Ufuk Celebi created FLINK-2134:
----------------------------------

             Summary: Deadlock in SuccessAfterNetworkBuffersFailureITCase
                 Key: FLINK-2134
                 URL: https://issues.apache.org/jira/browse/FLINK-2134
             Project: Flink
          Issue Type: Bug
    Affects Versions: master
            Reporter: Ufuk Celebi


I ran into the issue in a Travis run for a PR: 
https://s3.amazonaws.com/archive.travis-ci.org/jobs/64994288/log.txt

I can reproduce this locally by running SuccessAfterNetworkBuffersFailureITCase 
multiple times:

{code}
cluster = new ForkableFlinkMiniCluster(config, false);
for (int i = 0; i < 100; i++) {
   // run test programs CC, KMeans, CC
}
{code}

The iteration tasks wait for superstep notifications like this:

{code}
"Join (Join at 
runConnectedComponents(SuccessAfterNetworkBuffersFailureITCase.java:128)) 
(8/6)" daemon prio=5 tid=0x00007f95f374f800 nid=0x138a7 in Object.wait() 
[0x0000000123f2a000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000007f89e3440> (a java.lang.Object)
        at 
org.apache.flink.runtime.iterative.concurrent.SuperstepKickoffLatch.awaitStartOfSuperstepOrTermination(SuperstepKickoffLatch.java:57)
        - locked <0x00000007f89e3440> (a java.lang.Object)
        at 
org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:131)
        at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
        at java.lang.Thread.run(Thread.java:745)
{code}

I've asked [~rmetzger] to reproduce this and it deadlocks for him as well. The 
system needs to be under some load for this to occur after multiple runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to