Hi

这里手动 Checkpoint 是指 Savepoint 吧。从栈看是因为超时了,有可能是 savepoint 比较慢导致的。
这个你可以看一下 JM log,看看是否 savepoint 很久才完成。

另外,可以描述下你们使用 savepoint 的主要场景吗?
1. 为什么要使用 savepoint
2. 在你们的场景中能否用 checkpoint 代替 savepoint 呢?

Best,
Congxian


Zhou Zach <wander...@163.com> 于2020年6月19日周五 下午3:25写道:

>
>
>
>
> 2020-06-19 15:11:18,361 INFO  org.apache.flink.client.cli.CliFrontend
>                  - Triggering savepoint for job
> e229c76e6a1b43142cb4272523102ed1.
> 2020-06-19 15:11:18,378 INFO  org.apache.flink.client.cli.CliFrontend
>                  - Waiting for response...
> 2020-06-19 15:11:48,381 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
> 2020-06-19 15:11:48,382 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - backgroundOperationsLoop exiting
> 2020-06-19 15:11:48,385 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Session: 0x172b776fac82479 closed
> 2020-06-19 15:11:48,385 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> EventThread shut down for session: 0x172b776fac82479
> 2020-06-19 15:11:48,385 ERROR org.apache.flink.client.cli.CliFrontend
>                  - Error while running the command.
> org.apache.flink.util.FlinkException: Triggering a savepoint for the job
> e229c76e6a1b43142cb4272523102ed1 failed.
>         at
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:633)
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:611)
>         at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>         at
> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:608)
>         at
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:910)
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
>         at
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.util.concurrent.TimeoutException
>         at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:999)
>         at
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>         at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$14(FutureUtils.java:427)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)

Reply via email to