Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.
It will help a lot if you could share the logs of JobManager and TaskManager for the unexpected `SUSPENDED` job. Best, Yang Xiaolong Wang 于2022年5月16日周一 13:30写道: > Sorry for the late reply. > > I checked the logs in both jobmanager & taskmanager. > > During that time, there were no more logs there. > > How can I reproduce the issue ? > > On Thu, May 12, 2022 at 10:35 AM Yang Wang wrote: > >> The SUSPENDED state is usually caused by lost leadership. Maybe you could >> find more information about leader in the JobManager and TaskManager logs. >> >> Best, >> Yang >> >> Xiaolong Wang 于2022年5月11日周三 19:18写道: >> >>> Hello, >>> >>> Recently our Flink jobs on Native K8s encountered failing in the >>> `SUSPENDED` status and got restarted for no reason. >>> >>> Flink version: 1.13.2 >>> >>> Logs: >>> ``` >>> 2022-05-11 05:01:41 >>> >>> 2022-05-10 21:01:41,771 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering >>> checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job >>> .\n >>> 2022-05-11 05:01:43 >>> >>> 2022-05-10 21:01:42,860 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed >>> checkpoint 17921 for job (11840 bytes in >>> 866 ms).\n >>> 2022-05-11 05:04:34 >>> >>> 2022-05-10 21:04:34,550 INFO >>> org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a >>> new watch on TaskManager pods.\n >>> 2022-05-11 05:06:43 >>> >>> 2022-05-10 21:06:43,512 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering >>> checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job >>> .\n >>> 2022-05-11 05:06:44 >>> >>> 2022-05-10 21:06:44,441 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed >>> checkpoint 17922 for job (11840 bytes in >>> 977 ms).\n >>> 2022-05-11 05:11:45 >>> >>> 2022-05-10 21:11:44,826 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering >>> checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job >>> .\n >>> 2022-05-11 05:11:45 >>> >>> 2022-05-10 21:11:45,537 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed >>> checkpoint 17923 for job (11840 bytes in >>> 646 ms).\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,746 INFO >>> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess >>> [] - Stopping SessionDispatcherLeaderProcess.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,747 INFO >>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping >>> dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,747 INFO >>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all >>> currently running jobs of dispatcher akka.tcp:// >>> flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,749 INFO >>> org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster >>> for job >>> insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,752 INFO >>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job >>> reached terminal state SUSPENDED.\n >>> 2022-05-11 05:12:36 >>> >>> 2022-05-10 21:12:36,752 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job >>> insert-xxx_sink () switched from state >>> RUNNING to SUSPENDED.\n >>> 2022-05-11 05:12:36 >>> >>> org.apache.flink.util.FlinkException: Scheduler is being stopped.\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563) >>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n >>> 2022-05-11 05:12:36 >>> >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186) >>> ~[flink-dist_
Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.
The SUSPENDED state is usually caused by lost leadership. Maybe you could find more information about leader in the JobManager and TaskManager logs. Best, Yang Xiaolong Wang 于2022年5月11日周三 19:18写道: > Hello, > > Recently our Flink jobs on Native K8s encountered failing in the > `SUSPENDED` status and got restarted for no reason. > > Flink version: 1.13.2 > > Logs: > ``` > 2022-05-11 05:01:41 > > 2022-05-10 21:01:41,771 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job > .\n > 2022-05-11 05:01:43 > > 2022-05-10 21:01:42,860 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 17921 for job (11840 bytes in > 866 ms).\n > 2022-05-11 05:04:34 > > 2022-05-10 21:04:34,550 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a > new watch on TaskManager pods.\n > 2022-05-11 05:06:43 > > 2022-05-10 21:06:43,512 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job > .\n > 2022-05-11 05:06:44 > > 2022-05-10 21:06:44,441 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 17922 for job (11840 bytes in > 977 ms).\n > 2022-05-11 05:11:45 > > 2022-05-10 21:11:44,826 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job > .\n > 2022-05-11 05:11:45 > > 2022-05-10 21:11:45,537 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 17923 for job (11840 bytes in > 646 ms).\n > 2022-05-11 05:12:36 > > 2022-05-10 21:12:36,746 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [] - Stopping SessionDispatcherLeaderProcess.\n > 2022-05-11 05:12:36 > > 2022-05-10 21:12:36,747 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping > dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n > 2022-05-11 05:12:36 > > 2022-05-10 21:12:36,747 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all > currently running jobs of dispatcher akka.tcp:// > flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n > 2022-05-11 05:12:36 > > 2022-05-10 21:12:36,749 INFO org.apache.flink.runtime.jobmaster.JobMaster > [] - Stopping the JobMaster for job > insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n > 2022-05-11 05:12:36 > > 2022-05-10 21:12:36,752 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job > reached terminal state SUSPENDED.\n > 2022-05-11 05:12:36 > > 2022-05-10 21:12:36,752 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > insert-xxx_sink () switched from state > RUNNING to SUSPENDED.\n > 2022-05-11 05:12:36 > > org.apache.flink.util.FlinkException: Scheduler is being stopped.\n > 2022-05-11 05:12:36 > > at > org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607) > ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at > org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962) > ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at > org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926) > ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398) > ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) > ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563) > ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186) > ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > [flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > [flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > [flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > [flink-dist_2.11-1.13.2.jar:1.13.2]\n > 2022-05-11 05:12:36 > > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > [fli
Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.
Hello, Recently our Flink jobs on Native K8s encountered failing in the `SUSPENDED` status and got restarted for no reason. Flink version: 1.13.2 Logs: ``` 2022-05-11 05:01:41 2022-05-10 21:01:41,771 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job .\n 2022-05-11 05:01:43 2022-05-10 21:01:42,860 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 17921 for job (11840 bytes in 866 ms).\n 2022-05-11 05:04:34 2022-05-10 21:04:34,550 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a new watch on TaskManager pods.\n 2022-05-11 05:06:43 2022-05-10 21:06:43,512 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job .\n 2022-05-11 05:06:44 2022-05-10 21:06:44,441 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 17922 for job (11840 bytes in 977 ms).\n 2022-05-11 05:11:45 2022-05-10 21:11:44,826 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job .\n 2022-05-11 05:11:45 2022-05-10 21:11:45,537 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 17923 for job (11840 bytes in 646 ms).\n 2022-05-11 05:12:36 2022-05-10 21:12:36,746 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Stopping SessionDispatcherLeaderProcess.\n 2022-05-11 05:12:36 2022-05-10 21:12:36,747 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n 2022-05-11 05:12:36 2022-05-10 21:12:36,747 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all currently running jobs of dispatcher akka.tcp:// flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n 2022-05-11 05:12:36 2022-05-10 21:12:36,749 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n 2022-05-11 05:12:36 2022-05-10 21:12:36,752 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job reached terminal state SUSPENDED.\n 2022-05-11 05:12:36 2022-05-10 21:12:36,752 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job insert-xxx_sink () switched from state RUNNING to SUSPENDED.\n 2022-05-11 05:12:36 org.apache.flink.util.FlinkException: Scheduler is being stopped.\n 2022-05-11 05:12:36 at org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607) ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962) ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926) ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398) ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563) ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186) ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.13.2.jar:1.13.2]\n 2022-05-11 05:12:36 at akka.actor.ActorCell