[ https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642938#comment-16642938 ]
Till Rohrmann commented on FLINK-9788: -------------------------------------- Looking at the code, I think the problem is that we don't cancel the newly created/reset {{Executions}}. {{ExecutionVertex:579}} would then always fail when trying to be reset. The problem seems to originate in the {{ExecutionGraph#failGlobal}} method where we don't cancel {{Executions}} if the {{ExecutionGraph}} is in state {{RESTARTING}}. I think we could solve the problem by allowing the state transition {{RESTARTING --> FAILING}} by simply removing the {{RESTARTING}} branch in {{#failGlobal}}. That way, we would also cancel all newly created {{Executions}} before trying to restart. > ExecutionGraph Inconsistency prevents Job from recovering > --------------------------------------------------------- > > Key: FLINK-9788 > URL: https://issues.apache.org/jira/browse/FLINK-9788 > Project: Flink > Issue Type: Bug > Components: Core > Affects Versions: 1.6.0 > Environment: Rev: 4a06160 > Hadoop 2.8.3 > Reporter: Gary Yao > Priority: Blocker > Fix For: 1.7.0, 1.6.2 > > Attachments: jobmanager_5000.log > > > Deployment mode: YARN job mode with HA > After killing many TaskManagers in succession, the state of the > ExecutionGraph ran into an inconsistent state, which prevented job recovery. > The following stacktrace was logged in the JobManager log several hundred > times per second: > {noformat} > -08 16:47:18,855 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Job General purpose test job (37a794195840700b98feb23e99f7ea24) > switched from state RESTARTING to RESTARTING. > 2018-07-08 16:47:18,856 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting > the job General purpose test job (37a794195840700b98feb23e99f7ea24). > 2018-07-08 16:47:18,857 DEBUG > org.apache.flink.runtime.executiongraph.ExecutionGraph - Resetting > execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for > new execution. > 2018-07-08 16:47:18,857 WARN > org.apache.flink.runtime.executiongraph.ExecutionGraph - Failed to > restart the job. > java.lang.IllegalStateException: Cannot reset a vertex that is in > non-terminal state CREATED > at > org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610) > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573) > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251) > at > org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59) > at > org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > The resulting jobmanager log file was 4.7 GB in size. Find attached the first > 5000 lines of the log file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)