[ 
https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642744#comment-16642744
 ] 

Biao Liu commented on FLINK-9788:
---------------------------------

After checking the log file, I believe this is a critical bug. If the scenario 
happens below, the job would never recover as the log shows.
1. A failover happens due to lost of TM, ExecutionGraph tries to restart itself
2. The restarter(1) is in ExecutionGraph.restart(), resetting all executions
3. Another fatal error happens, it triggers ExecutionGraph.failGlobal, the 
state of ExecutionGraph is RESTARTING, it would increase the global version and 
try to restart ExecutionGraph too
4. The restarter(1) would fail, due to the global version is mismatched, some 
executions are resetted to CREATED, some executions are not
5. As the restarter(1) is failed, it would trigger another failGlobal without 
changing the state of ExecutionGraph (RESTARTING)
6. Some executions resetted in step 4 would fail forever in restarting

I believe the problem is that the state RESTARTING of ExecutionGraph is not a 
safe state that we can do the restarting without any cancelation. Maybe we 
should do a cancelation while the state is RESTARTING. 

> ExecutionGraph Inconsistency prevents Job from recovering
> ---------------------------------------------------------
>
>                 Key: FLINK-9788
>                 URL: https://issues.apache.org/jira/browse/FLINK-9788
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.6.0
>         Environment: Rev: 4a06160
> Hadoop 2.8.3
>            Reporter: Gary Yao
>            Priority: Critical
>             Fix For: 1.7.0, 1.6.2
>
>         Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the 
> ExecutionGraph ran into an inconsistent state, which prevented job recovery. 
> The following stacktrace was logged in the JobManager log several hundred 
> times per second:
> {noformat}
> -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 
>        - Job General purpose test job (37a794195840700b98feb23e99f7ea24) 
> switched from state RESTARTING to RESTARTING.
> 2018-07-08 16:47:18,856 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting 
> the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Resetting 
> execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for 
> new execution.
> 2018-07-08 16:47:18,857 WARN  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to 
> restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in 
> non-terminal state CREATED
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
>         at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
>         at 
> org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
>         at 
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first 
> 5000 lines of the log file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to