[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804840#comment-17804840
 ] 

Zhenqiu Huang commented on FLINK-34007:
---------------------------------------

>From initial investigation, the job manager is initially lose the leadership, 
>then goes to SUSPENDED status. Shouldn't the job manager exit directly rather 
>than goes to SUSPENDED status?

2024-01-08 21:44:57,142 INFO  
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] - 
JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was 
revoked leadership with leader id 9987190b-35f4-4238-b317-057dc3615e4d. 
Stopping current JobMasterServiceProcess.
2024-01-08 21:45:16,280 INFO  
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - 
http://172.16.197.136:8081 lost leadership
2024-01-08 21:45:16,280 INFO  
org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
Resource manager service is revoked leadership with session id 
9987190b-35f4-4238-b317-057dc3615e4d.
2024-01-08 21:45:16,281 INFO  
org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - 
DefaultDispatcherRunner was revoked the leadership with leader id 
9987190b-35f4-4238-b317-057dc3615e4d. Stopping the DispatcherLeaderProcess.
2024-01-08 21:45:16,282 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Stopping SessionDispatcherLeaderProcess.
2024-01-08 21:45:16,282 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Stopping 
dispatcher pekko.tcp://flink@172.16.197.136:6123/user/rpc/dispatcher_1.
2024-01-08 21:45:16,282 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Stopping all 
currently running jobs of dispatcher 
pekko.tcp://flink@172.16.197.136:6123/user/rpc/dispatcher_1.
2024-01-08 21:45:16,282 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Stopping the JobMaster for job 
'amp-ade-fitness-clickstream-projection-uat' (217cee964b2cfdc3115fb74cac0ec550).
2024-01-08 21:45:16,285 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 
217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED.
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - 
Stopping credential renewal
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - 
Stopped credential renewal
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Closing the slot manager.
2024-01-08 21:45:16,286 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] 
- Suspending the slot manager.
2024-01-08 21:45:16,287 INFO  
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
Stopping DefaultLeaderRetrievalService.
2024-01-08 21:45:16,287 INFO  
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] 
- Stopping 
KubernetesLeaderRetrievalDriver{configMapName='acsflink-5e92d541f0cd0ad7352c4dc5463c54df-cluster-config-map'}.
2024-01-08 21:45:16,287 INFO  
org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer
 [] - Stopped to watch for 
amp-ae-video-uat/acsflink-5e92d541f0cd0ad7352c4dc5463c54df-cluster-config-map, 
watching id:cc34317a-3299-4cb5-a966-55cb546e8bf9
2024-01-08 21:45:16,287 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550) 
switched from state RUNNING to SUSPENDED.

> Flink Job stuck in suspend state after recovery from failure in HA Mode
> -----------------------------------------------------------------------
>
>                 Key: FLINK-34007
>                 URL: https://issues.apache.org/jira/browse/FLINK-34007
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.1, 1.18.2
>            Reporter: Zhenqiu Huang
>            Priority: Major
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log:
> 2024-01-04 02:58:39,210 INFO  
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] - 
> JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was 
> revoked leadership with leader id eda6fee6-ce02-4076-9a99-8c43a92629f7. 
> Stopping current JobMasterServiceProcess.
> 2024-01-04 02:58:58,347 INFO  
> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - 
> http://172.16.71.11:8081 lost leadership
> 2024-01-04 02:58:58,347 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is revoked leadership with session id 
> eda6fee6-ce02-4076-9a99-8c43a92629f7.
> 2024-01-04 02:58:58,348 INFO  
> org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - 
> DefaultDispatcherRunner was revoked the leadership with leader id 
> eda6fee6-ce02-4076-9a99-8c43a92629f7. Stopping the DispatcherLeaderProcess.
> 2024-01-04 02:58:58,348 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
> 2024-01-04 02:58:58,349 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Stopping 
> dispatcher pekko.tcp://flink@172.16.71.11:6123/user/rpc/dispatcher_1.
> 2024-01-04 02:58:58,349 INFO  org.apache.flink.runtime.jobmaster.JobMaster    
>              [] - Stopping the JobMaster for job 
> 'amp-ade-fitness-clickstream-projection-uat' 
> (217cee964b2cfdc3115fb74cac0ec550).
> 2024-01-04 02:58:58,349 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Stopping 
> all currently running jobs of dispatcher 
> pekko.tcp://flink@172.16.71.11:6123/user/rpc/dispatcher_1.
> 2024-01-04 02:58:58,351 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 
> 217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED.
> 2024-01-04 02:58:58,352 INFO  
> org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - 
> Stopping credential renewal
> 2024-01-04 02:58:58,352 INFO  
> org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - 
> Stopped credential renewal
> 2024-01-04 02:58:58,352 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager 
> [] - Closing the slot manager.
> 2024-01-04 02:58:58,351 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
> amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550) 
> switched from state RUNNING to SUSPENDED.
> org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped.
>       at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:474)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:1093)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:1056)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:454) 
> ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:239)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.lambda$terminate$0(PekkoRpcActor.java:574)
>  ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.terminate(PekkoRpcActor.java:573)
>  ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
>       at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleControlMessage(PekkoRpcActor.java:196)
>  ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
>       at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) 
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
>       at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) 
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
>       at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) 
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
>       at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) 
> [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase]
> TM Error Log: 
> 2024-01-04 11:23:01,334 ERROR 
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Fatal error 
> occurred in TaskExecutor 
> pekko.tcp://flink@172.16.182.165:6122/user/rpc/taskmanager_0.   │
> │ 
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>  Could not register at the ResourceManager within the specified maximum 
> registration duration PT5M. This indicates a p │
> │     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1558)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]                                  
>                   │
> │     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$18(TaskExecutor.java:1543)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]                                  
>    │
> │     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
>  ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]       
>                 │
> │     at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase]                                  
>       │
> │     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451)
>  ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]       
>                          │
> │     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218)
>  ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]       
>                        │
> │     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
>  ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]       
>                           │
> │     at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                       │
> │     at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                       │
> │     at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                                    │
> │     at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                                   │
> │     at 
> org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
>  [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]        
>                                  │
> │     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                             │
> │     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                             │
> │     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                             │
> │     at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                                     │
> │     at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                                    │
> │     at 
> org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                     │
> │     at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                            │
> │     at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                                    │
> │     at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                             │
> │     at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                                        │
> │     at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) 
> [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase]         
>                                                       │
> │     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) 
> [?:?]                                                                         
>                                                 │
> │     at 
> java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
>  [?:?]                                                                        
>                                 │
> │     at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] 
>                                                                               
>                                            │
> │     at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) 
> [?:?]                                                                         
>                                             │
> │     at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) 
> [?:?]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to