[jira] [Commented] (FLINK-18959) Fail to archiveExecutionGraph because job is not finished when dispatcher close

Liu (Jira) Mon, 24 Aug 2020 19:22:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183683#comment-17183683
 ]


Liu commented on FLINK-18959:
-----------------------------

Thanks for the reply, [~trohrmann] and [~aljoscha]. After reverting the code in 
MiniDispatcher, all tests run ok. But I also notice that something else should 
be changed. In MiniDispatcher's method jobReachedGloballyTerminalState, the 
cluster shuts down only if executionMode is DETACHED. Upon cancellation, the 
dispatcher will not shut down if executionMode is NORMAL. So we should shut 
down cluster no matter executionMode is DETACHED or NORMAL.

> Fail to archiveExecutionGraph because job is not finished when dispatcher 
> close
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-18959
>                 URL: https://issues.apache.org/jira/browse/FLINK-18959
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0, 1.12.0, 1.11.1
>            Reporter: Liu
>            Assignee: Liu
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.12.0, 1.11.2, 1.10.3
>
>         Attachments: flink-debug-log
>
>
> When job is cancelled, we expect to see it in flink's history server. But I 
> can not see my job after it is cancelled.
> After digging into the problem, I find that the function 
> archiveExecutionGraph is not executed. Below is the brief log:
> {panel:title=log}
> 2020-08-14 15:10:06,406 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph 
> [flink-akka.actor.default-dispatcher- 15] - Job EtlAndWindow 
> (6f784d4cc5bae88a332d254b21660372) switched from state RUNNING to CANCELLING.
> 2020-08-14 15:10:06,415 DEBUG 
> org.apache.flink.runtime.dispatcher.MiniDispatcher 
> [flink-akka.actor.default-dispatcher-3] - Shutting down per-job cluster 
> because the job was canceled.
> 2020-08-14 15:10:06,629 INFO 
> org.apache.flink.runtime.dispatcher.MiniDispatcher 
> [flink-akka.actor.default-dispatcher-3] - Stopping dispatcher 
> akka.tcp://flink@bjfk-c9865.yz02:38663/user/dispatcher.
> 2020-08-14 15:10:06,629 INFO 
> org.apache.flink.runtime.dispatcher.MiniDispatcher 
> [flink-akka.actor.default-dispatcher-3] - Stopping all currently running jobs 
> of dispatcher akka.tcp://flink@bjfk-c9865.yz02:38663/user/dispatcher.
> 2020-08-14 15:10:06,631 INFO org.apache.flink.runtime.jobmaster.JobMaster 
> [flink-akka.actor.default-dispatcher-29] - Stopping the JobMaster for job 
> EtlAndWindow(6f784d4cc5bae88a332d254b21660372).
> 2020-08-14 15:10:06,632 DEBUG org.apache.flink.runtime.jobmaster.JobMaster 
> [flink-akka.actor.default-dispatcher-29] - Disconnect TaskExecutor 
> container_e144_1590060720089_2161_01_000006 because: Stopping JobMaster for 
> job EtlAndWindow(6f784d4cc5bae88a332d254b21660372).
> 2020-08-14 15:10:06,646 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph 
> [flink-akka.actor.default-dispatcher-29] - Job EtlAndWindow 
> (6f784d4cc5bae88a332d254b21660372) switched from state CANCELLING to CANCELED.
> 2020-08-14 15:10:06,664 DEBUG 
> org.apache.flink.runtime.dispatcher.MiniDispatcher 
> [flink-akka.actor.default-dispatcher-4] - There is a newer JobManagerRunner 
> for the job 6f784d4cc5bae88a332d254b21660372.
> {panel}
> From the log, we can see that job is not finished when dispatcher closes. The 
> process is as following:
>  * Receive cancel command and send it to all tasks async.
>  * In MiniDispatcher, begin to shutting down per-job cluster.
>  * Stopping dispatcher and remove job.
>  * Job is cancelled and callback is executed in method startJobManagerRunner.
>  * Because job is removed before, so currentJobManagerRunner is null which 
> not equals to the original jobManagerRunner. In this case, 
> archivedExecutionGraph will not be uploaded.
> In normal cases, I find that job is cancelled first and then dispatcher is 
> stopped so that archivedExecutionGraph will succeed. But I think that the 
> order is not constrained and it is hard to know which comes first. 
> Above is what I suspected. If so, then we should fix it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18959) Fail to archiveExecutionGraph because job is not finished when dispatcher close

Reply via email to