[ 
https://issues.apache.org/jira/browse/FLINK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094584#comment-16094584
 ] 

ASF GitHub Bot commented on FLINK-7216:
---------------------------------------

Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4364#discussion_r128494097
  
    --- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java
 ---
    @@ -581,6 +565,106 @@ public void testSuspendWhileRestarting() throws 
Exception {
                assertEquals(JobStatus.SUSPENDED, eg.getState());
        }
     
    +   @Test
    +   public void testConcurrentLocalFailAndRestart() throws Exception {
    +           final ExecutionGraph eg = createSimpleTestGraph(new 
FixedDelayRestartStrategy(10, 0L));
    +           eg.setScheduleMode(ScheduleMode.EAGER);
    +           eg.scheduleForExecution();
    +
    +           waitUntilDeployedAndSwitchToRunning(eg, 1000);
    +
    +           final ExecutionJobVertex vertex = 
eg.getVerticesTopologically().iterator().next();
    +           final Execution first = 
vertex.getTaskVertices()[0].getCurrentExecutionAttempt();
    +           final Execution last = 
vertex.getTaskVertices()[vertex.getParallelism() - 
1].getCurrentExecutionAttempt();
    +
    +           final OneShotLatch failTrigger = new OneShotLatch();
    +           final CountDownLatch readyLatch = new CountDownLatch(2);
    +
    +           Thread failure1 = new Thread() {
    +                   @Override
    +                   public void run() {
    +                           readyLatch.countDown();
    +                           try {
    +                                   failTrigger.await();
    +                           } catch (InterruptedException ignored) {}
    +
    +                           first.fail(new Exception("intended test failure 
1"));
    +                   }
    +           };
    +
    +           Thread failure2 = new Thread() {
    +                   @Override
    +                   public void run() {
    +                           readyLatch.countDown();
    +                           try {
    +                                   failTrigger.await();
    +                           } catch (InterruptedException ignored) {}
    +
    +                           last.fail(new Exception("intended test failure 
2"));
    +                   }
    +           };
    +
    +           // make sure both threads start simultaneously
    +           failure1.start();
    +           failure2.start();
    +           readyLatch.await();
    +           failTrigger.trigger();
    +
    +           waitUntilJobStatus(eg, JobStatus.FAILING, 1000);
    +           completeCancellingForAllVertices(eg);
    +
    +           waitUntilJobStatus(eg, JobStatus.RUNNING, 1000);
    +           waitUntilDeployedAndSwitchToRunning(eg, 1000);
    +           finishAllVertices(eg);
    +
    +           eg.waitUntilTerminal();
    +           assertEquals(JobStatus.FINISHED, eg.getState());
    +   }
    +
    +   @Test
    +   public void testConcurrentGlobalFailAndRestarts() throws Exception {
    --- End diff --
    
    I tried running this on current master and the test failed but I didn't see 
a "storm of restarts"


> ExecutionGraph can perform concurrent global restarts to scheduling
> -------------------------------------------------------------------
>
>                 Key: FLINK-7216
>                 URL: https://issues.apache.org/jira/browse/FLINK-7216
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.2.1, 1.3.1
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Blocker
>             Fix For: 1.4.0, 1.3.2
>
>
> Because ExecutionGraph restarts happen asynchronously and possibly delayed, 
> it can happen in rare corner cases that two restarts are attempted 
> concurrently, in which case some structures on the Execution Graph undergo a 
> concurrent access:
> Sample stack trace:
> {code}
> WARN  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed 
> to restart the job.
> java.lang.IllegalStateException: SlotSharingGroup cannot clear task 
> assignment, group still has allocated resources.
>     at 
> org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroup.clearTaskAssignment(SlotSharingGroup.java:78)
>     at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:535)
>     at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1151)
>     at 
> org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestarter$1.call(ExecutionGraphRestarter.java:40)
>     at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)
>     at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>     at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> {code}
> The solution is to strictly guard against "subsumed" restarts via the 
> {{globalModVersion}} in a similar way as we fence local restarts against 
> global restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to