Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-16 Thread Yang Wang
It will help a lot if you could share the logs of JobManager and
TaskManager for the unexpected `SUSPENDED` job.

Best,
Yang


Xiaolong Wang  于2022年5月16日周一 13:30写道:

> Sorry for the late reply.
>
> I checked the logs in both jobmanager & taskmanager.
>
> During that time, there were no more logs there.
>
> How can I reproduce the issue ?
>
> On Thu, May 12, 2022 at 10:35 AM Yang Wang  wrote:
>
>> The SUSPENDED state is usually caused by lost leadership. Maybe you could
>> find more information about leader in the JobManager and TaskManager logs.
>>
>> Best,
>> Yang
>>
>> Xiaolong Wang  于2022年5月11日周三 19:18写道:
>>
>>> Hello,
>>>
>>> Recently our Flink jobs on Native K8s encountered failing in the
>>> `SUSPENDED` status and got restarted for no reason.
>>>
>>> Flink version: 1.13.2
>>>
>>> Logs:
>>> ```
>>> 2022-05-11 05:01:41
>>>
>>> 2022-05-10 21:01:41,771 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
>>> checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job
>>> .\n
>>> 2022-05-11 05:01:43
>>>
>>> 2022-05-10 21:01:42,860 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
>>> checkpoint 17921 for job  (11840 bytes in
>>> 866 ms).\n
>>> 2022-05-11 05:04:34
>>>
>>> 2022-05-10 21:04:34,550 INFO
>>> org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a
>>> new watch on TaskManager pods.\n
>>> 2022-05-11 05:06:43
>>>
>>> 2022-05-10 21:06:43,512 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
>>> checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job
>>> .\n
>>> 2022-05-11 05:06:44
>>>
>>> 2022-05-10 21:06:44,441 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
>>> checkpoint 17922 for job  (11840 bytes in
>>> 977 ms).\n
>>> 2022-05-11 05:11:45
>>>
>>> 2022-05-10 21:11:44,826 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
>>> checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job
>>> .\n
>>> 2022-05-11 05:11:45
>>>
>>> 2022-05-10 21:11:45,537 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
>>> checkpoint 17923 for job  (11840 bytes in
>>> 646 ms).\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,746 INFO
>>> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
>>> [] - Stopping SessionDispatcherLeaderProcess.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,747 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
>>> dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,747 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all
>>> currently running jobs of dispatcher akka.tcp://
>>> flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,749 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster
>>> for job
>>> insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,752 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
>>>  reached terminal state SUSPENDED.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,752 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
>>> insert-xxx_sink () switched from state
>>> RUNNING to SUSPENDED.\n
>>> 2022-05-11 05:12:36
>>>
>>> org.apache.flink.util.FlinkException: Scheduler is being stopped.\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607)
>>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962)
>>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926)
>>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398)
>>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
>>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
>>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
>>> ~[flink-dist_

Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-11 Thread Yang Wang
The SUSPENDED state is usually caused by lost leadership. Maybe you could
find more information about leader in the JobManager and TaskManager logs.

Best,
Yang

Xiaolong Wang  于2022年5月11日周三 19:18写道:

> Hello,
>
> Recently our Flink jobs on Native K8s encountered failing in the
> `SUSPENDED` status and got restarted for no reason.
>
> Flink version: 1.13.2
>
> Logs:
> ```
> 2022-05-11 05:01:41
>
> 2022-05-10 21:01:41,771 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job
> .\n
> 2022-05-11 05:01:43
>
> 2022-05-10 21:01:42,860 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 17921 for job  (11840 bytes in
> 866 ms).\n
> 2022-05-11 05:04:34
>
> 2022-05-10 21:04:34,550 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a
> new watch on TaskManager pods.\n
> 2022-05-11 05:06:43
>
> 2022-05-10 21:06:43,512 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job
> .\n
> 2022-05-11 05:06:44
>
> 2022-05-10 21:06:44,441 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 17922 for job  (11840 bytes in
> 977 ms).\n
> 2022-05-11 05:11:45
>
> 2022-05-10 21:11:44,826 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job
> .\n
> 2022-05-11 05:11:45
>
> 2022-05-10 21:11:45,537 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 17923 for job  (11840 bytes in
> 646 ms).\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,746 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
> [] - Stopping SessionDispatcherLeaderProcess.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,747 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
> dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,747 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all
> currently running jobs of dispatcher akka.tcp://
> flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,749 INFO org.apache.flink.runtime.jobmaster.JobMaster
> [] - Stopping the JobMaster for job
> insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,752 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
>  reached terminal state SUSPENDED.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,752 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> insert-xxx_sink () switched from state
> RUNNING to SUSPENDED.\n
> 2022-05-11 05:12:36
>
> org.apache.flink.util.FlinkException: Scheduler is being stopped.\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [fli

Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-11 Thread Xiaolong Wang
Hello,

Recently our Flink jobs on Native K8s encountered failing in the
`SUSPENDED` status and got restarted for no reason.

Flink version: 1.13.2

Logs:
```
2022-05-11 05:01:41

2022-05-10 21:01:41,771 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job
.\n
2022-05-11 05:01:43

2022-05-10 21:01:42,860 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 17921 for job  (11840 bytes in
866 ms).\n
2022-05-11 05:04:34

2022-05-10 21:04:34,550 INFO
org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a
new watch on TaskManager pods.\n
2022-05-11 05:06:43

2022-05-10 21:06:43,512 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job
.\n
2022-05-11 05:06:44

2022-05-10 21:06:44,441 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 17922 for job  (11840 bytes in
977 ms).\n
2022-05-11 05:11:45

2022-05-10 21:11:44,826 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job
.\n
2022-05-11 05:11:45

2022-05-10 21:11:45,537 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 17923 for job  (11840 bytes in
646 ms).\n
2022-05-11 05:12:36

2022-05-10 21:12:36,746 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
[] - Stopping SessionDispatcherLeaderProcess.\n
2022-05-11 05:12:36

2022-05-10 21:12:36,747 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
2022-05-11 05:12:36

2022-05-10 21:12:36,747 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all
currently running jobs of dispatcher akka.tcp://
flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
2022-05-11 05:12:36

2022-05-10 21:12:36,749 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Stopping the JobMaster for job
insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n
2022-05-11 05:12:36

2022-05-10 21:12:36,752 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
 reached terminal state SUSPENDED.\n
2022-05-11 05:12:36

2022-05-10 21:12:36,752 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
insert-xxx_sink () switched from state
RUNNING to SUSPENDED.\n
2022-05-11 05:12:36

org.apache.flink.util.FlinkException: Scheduler is being stopped.\n
2022-05-11 05:12:36

at
org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607)
~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at
org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962)
~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at
org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926)
~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398)
~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]\n
2022-05-11 05:12:36

at akka.actor.ActorCell