[jira] [Commented] (FLINK-9788) ExecutionGraph Inconsistency prevents Job from recovering

Till Rohrmann (JIRA) Tue, 09 Oct 2018 01:13:13 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642938#comment-16642938
 ]


Till Rohrmann commented on FLINK-9788:
--------------------------------------

Looking at the code, I think the problem is that we don't cancel the newly 
created/reset {{Executions}}. {{ExecutionVertex:579}} would then always fail 
when trying to be reset. The problem seems to originate in the 
{{ExecutionGraph#failGlobal}} method where we don't cancel {{Executions}} if 
the {{ExecutionGraph}} is in state {{RESTARTING}}. I think we could solve the 
problem by allowing the state transition {{RESTARTING --> FAILING}} by simply 
removing the {{RESTARTING}} branch in {{#failGlobal}}. That way, we would also 
cancel all newly created {{Executions}} before trying to restart.

> ExecutionGraph Inconsistency prevents Job from recovering
> ---------------------------------------------------------
>
>                 Key: FLINK-9788
>                 URL: https://issues.apache.org/jira/browse/FLINK-9788
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.6.0
>         Environment: Rev: 4a06160
> Hadoop 2.8.3
>            Reporter: Gary Yao
>            Priority: Blocker
>             Fix For: 1.7.0, 1.6.2
>
>         Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the 
> ExecutionGraph ran into an inconsistent state, which prevented job recovery. 
> The following stacktrace was logged in the JobManager log several hundred 
> times per second:
> {noformat}
> -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 
>        - Job General purpose test job (37a794195840700b98feb23e99f7ea24) 
> switched from state RESTARTING to RESTARTING.
> 2018-07-08 16:47:18,856 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting 
> the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Resetting 
> execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for 
> new execution.
> 2018-07-08 16:47:18,857 WARN  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to 
> restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in 
> non-terminal state CREATED
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
>         at 
> org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
>         at 
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first 
> 5000 lines of the log file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9788) ExecutionGraph Inconsistency prevents Job from recovering

Reply via email to