[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804840#comment-17804840 ]
Zhenqiu Huang commented on FLINK-34007: --------------------------------------- >From initial investigation, the job manager is initially lose the leadership, >then goes to SUSPENDED status. Shouldn't the job manager exit directly rather >than goes to SUSPENDED status? 2024-01-08 21:44:57,142 INFO org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] - JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was revoked leadership with leader id 9987190b-35f4-4238-b317-057dc3615e4d. Stopping current JobMasterServiceProcess. 2024-01-08 21:45:16,280 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - http://172.16.197.136:8081 lost leadership 2024-01-08 21:45:16,280 INFO org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Resource manager service is revoked leadership with session id 9987190b-35f4-4238-b317-057dc3615e4d. 2024-01-08 21:45:16,281 INFO org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - DefaultDispatcherRunner was revoked the leadership with leader id 9987190b-35f4-4238-b317-057dc3615e4d. Stopping the DispatcherLeaderProcess. 2024-01-08 21:45:16,282 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Stopping SessionDispatcherLeaderProcess. 2024-01-08 21:45:16,282 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping dispatcher pekko.tcp://flink@172.16.197.136:6123/user/rpc/dispatcher_1. 2024-01-08 21:45:16,282 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all currently running jobs of dispatcher pekko.tcp://flink@172.16.197.136:6123/user/rpc/dispatcher_1. 2024-01-08 21:45:16,282 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job 'amp-ade-fitness-clickstream-projection-uat' (217cee964b2cfdc3115fb74cac0ec550). 2024-01-08 21:45:16,285 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED. 2024-01-08 21:45:16,286 INFO org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Stopping credential renewal 2024-01-08 21:45:16,286 INFO org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Stopped credential renewal 2024-01-08 21:45:16,286 INFO org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Closing the slot manager. 2024-01-08 21:45:16,286 INFO org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Suspending the slot manager. 2024-01-08 21:45:16,287 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService. 2024-01-08 21:45:16,287 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Stopping KubernetesLeaderRetrievalDriver{configMapName='acsflink-5e92d541f0cd0ad7352c4dc5463c54df-cluster-config-map'}. 2024-01-08 21:45:16,287 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Stopped to watch for amp-ae-video-uat/acsflink-5e92d541f0cd0ad7352c4dc5463c54df-cluster-config-map, watching id:cc34317a-3299-4cb5-a966-55cb546e8bf9 2024-01-08 21:45:16,287 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550) switched from state RUNNING to SUSPENDED. > Flink Job stuck in suspend state after recovery from failure in HA Mode > ----------------------------------------------------------------------- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.18.1, 1.18.2 > Reporter: Zhenqiu Huang > Priority: Major > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log: > 2024-01-04 02:58:39,210 INFO > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] - > JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was > revoked leadership with leader id eda6fee6-ce02-4076-9a99-8c43a92629f7. > Stopping current JobMasterServiceProcess. > 2024-01-04 02:58:58,347 INFO > org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - > http://172.16.71.11:8081 lost leadership > 2024-01-04 02:58:58,347 INFO > org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - > Resource manager service is revoked leadership with session id > eda6fee6-ce02-4076-9a99-8c43a92629f7. > 2024-01-04 02:58:58,348 INFO > org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - > DefaultDispatcherRunner was revoked the leadership with leader id > eda6fee6-ce02-4076-9a99-8c43a92629f7. Stopping the DispatcherLeaderProcess. > 2024-01-04 02:58:58,348 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] > - Stopping SessionDispatcherLeaderProcess. > 2024-01-04 02:58:58,349 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping > dispatcher pekko.tcp://flink@172.16.71.11:6123/user/rpc/dispatcher_1. > 2024-01-04 02:58:58,349 INFO org.apache.flink.runtime.jobmaster.JobMaster > [] - Stopping the JobMaster for job > 'amp-ade-fitness-clickstream-projection-uat' > (217cee964b2cfdc3115fb74cac0ec550). > 2024-01-04 02:58:58,349 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping > all currently running jobs of dispatcher > pekko.tcp://flink@172.16.71.11:6123/user/rpc/dispatcher_1. > 2024-01-04 02:58:58,351 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job > 217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED. > 2024-01-04 02:58:58,352 INFO > org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - > Stopping credential renewal > 2024-01-04 02:58:58,352 INFO > org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - > Stopped credential renewal > 2024-01-04 02:58:58,352 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager > [] - Closing the slot manager. > 2024-01-04 02:58:58,351 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550) > switched from state RUNNING to SUSPENDED. > org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped. > at > org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:474) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:1093) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:1056) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:454) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:239) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.lambda$terminate$0(PekkoRpcActor.java:574) > ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.terminate(PekkoRpcActor.java:573) > ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleControlMessage(PekkoRpcActor.java:196) > ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] > at > org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) > [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] > at > org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) > [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] > at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) > [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) > [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] > TM Error Log: > 2024-01-04 11:23:01,334 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Fatal error > occurred in TaskExecutor > pekko.tcp://flink@172.16.182.165:6122/user/rpc/taskmanager_0. │ > │ > org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: > Could not register at the ResourceManager within the specified maximum > registration duration PT5M. This indicates a p │ > │ at > org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1558) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > │ > │ at > org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$18(TaskExecutor.java:1543) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > │ > │ at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) > ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] > │ > │ at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451) > ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218) > ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168) > ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at > org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at > org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at > org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at > org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) > [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] > │ > │ at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) > [?:?] > │ > │ at > java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) > [?:?] > │ > │ at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] > > │ > │ at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) > [?:?] > │ > │ at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) > [?:?] -- This message was sent by Atlassian Jira (v8.20.10#820010)