回复: Jobmanager restart after it has been requested to stop

Liting Liu (litiliu) via user Sun, 04 Feb 2024 18:17:12 -0800

Thank you, Yang:
      We have found the root cause.
      In the logic of Flink operator, it  calls Flink's rest API to stop this 
job then calls the K8s's API to stop the deployment of Flink jobManager. 
However it took more than one minute for K8s to delete that deployment, so when 
the JM's main contain has been successfully shut down by the REST API, then it 
was restarted by the restart policy, because the pod was still not deleted.   
That's why we observed `Jobmanager restart after it has been requested to stop`
________________________________
发件人: Yang Wang <[email protected]>
发送时间: 2024年2月2日 17:56
收件人: Liting Liu (litiliu) <[email protected]>
抄送: user <[email protected]>
主题: Re: Jobmanager restart after it has been requested to stop


If you could find the "Deregistering Flink Kubernetes cluster, clusterId" in 
the JobManager log, then it is not the expected behavior.

Having the full logs of JobManager Pod before restarted will help a lot.



Best,
Yang

On Fri, Feb 2, 2024 at 1:26 PM Liting Liu (litiliu) via user 
<[email protected]<mailto:[email protected]>> wrote:
Hi, community:
      I'm running a Flink 1.14.3 job with flink-Kubernetes-operator-1.6.0 on 
the AWS. I found my flink jobmananger container's thread restarted after this 
flinkdeployment has been requested to stop, here is the log of jobmanager:

2024-02-01 21:57:48,977 tn="flink-akka.actor.default-dispatcher-107478" INFO  
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
[] - Application CANCELED:
java.util.concurrent.CompletionException: 
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: 
Application Status: CANCELED
      at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:353)
 ~[flink-dist_2.11-1.14.3.jar:1.14.3]
      at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
~[?:1.8.0_322]
2024-02-01 21:57:48,984 tn="flink-akka.actor.default-dispatcher-107484" INFO  
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting 
down rest endpoint.
2024-02-01 21:57:49,103 tn="flink-akka.actor.default-dispatcher-107478" INFO  
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent
 [] - Closing components.
2024-02-01 21:57:49,105 tn="flink-akka.actor.default-dispatcher-107484" INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Stopped 
dispatcher akka.tcp://flink@
2024-02-01 21:57:49,112 
tn="AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Stopping Akka 
RPC service.
2024-02-01 21:57:49,286 tn="flink-metrics-15" INFO  
akka.remote.RemoteActorRefProvider$RemotingTerminator        [] - Remoting shut 
down.
2024-02-01 21:57:49,387 tn="main" INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Terminating 
cluster entrypoint process KubernetesApplicationClusterEntrypoint with exit 
code 0.
2024-02-01 21:57:53,828 tn="main" INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -     
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
2024-02-01 21:57:54,287 tn="main" INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Starting 
KubernetesApplicationClusterEntrypoint.


I found the JM main container's containerId remains the same, after the JM 
auto-restart.
why did this process start to run after it had been requested to stop?

回复: Jobmanager restart after it has been requested to stop

Reply via email to