Gary Yao created FLINK-9788:
-------------------------------
Summary: ExecutionGraph Inconsistency prevents Job from recovering
Key: FLINK-9788
URL: https://issues.apache.org/jira/browse/FLINK-9788
Project: Flink
Issue Type: Bug
Components: Core
Affects Versions: 1.6.0
Environment: Rev: 4a06160
Hadoop 2.8.3
Reporter: Gary Yao
Attachments: jobmanager_5000.log
Deployment mode: YARN job mode with HA
After killing many TaskManagers in succession, the state of the ExecutionGraph
ran into an inconsistent state, which prevented job recovery. The following
stacktrace was logged in the JobManager log several hundred times per second:
{noformat}
-08 16:47:18,855 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
- Job General purpose test job (37a794195840700b98feb23e99f7ea24) switched
from state RESTARTING to RESTARTING.
2018-07-08 16:47:18,856 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the
job General purpose test job (37a794195840700b98feb23e99f7ea24).
2018-07-08 16:47:18,857 DEBUG
org.apache.flink.runtime.executiongraph.ExecutionGraph - Resetting
execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for new
execution.
2018-07-08 16:47:18,857 WARN
org.apache.flink.runtime.executiongraph.ExecutionGraph - Failed to
restart the job.
java.lang.IllegalStateException: Cannot reset a vertex that is in non-terminal
state CREATED
at
org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
at
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
at
org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
at
org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
at
org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
The resulting jobmanager log file was 4.7 GB in size. Find attached the first
5000 lines of the log file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)