Hi Robert, it would be interesting to see the corresponding taskmanager/jobmanager logs. That would help in finding out why the savepoint creation failed. Just to verify: The savepoint data wasn't written to S3 even after the timeout happened, was it?
Best, Matthias On Thu, May 27, 2021 at 7:50 PM Robert Cullen <cinquate...@gmail.com> wrote: > I triggered a savepoint from a currently running job. Although the > directory structure gets created in the MINIO S3 store, the command > ultimately fails without writing the data. > > root@flink-client:/opt/flink# ./bin/flink list --target kubernetes-session > -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa > 2021-05-27 17:37:00,409 INFO > org.apache.flink.kubernetes.KubernetesClusterDescriptor [] - Retrieve > flink cluster flink-jobmanager successfully, JobManager Web Interface: > http://flink-jobmanager-rest.cmdaa:8081 > Waiting for response... > ------------------ Running/Restarting Jobs ------------------- > 27.05.2021 16:50:00 : 72f614340dc1a7416d0613362d1ef83b : Streaming Log Count > (RUNNING) > -------------------------------------------------------------- > No scheduled jobs. > root@flink-client:/opt/flink# ./bin/flink savepoint > 72f614340dc1a7416d0613362d1ef83b --target kubernetes-session > -Dkubernetes.cluster-id=flink-jobmanager -Dkubernetes.namespace=cmdaa > 2021-05-27 17:37:58,776 INFO > org.apache.flink.kubernetes.KubernetesClusterDescriptor [] - Retrieve > flink cluster flink-jobmanager successfully, JobManager Web Interface: > http://flink-jobmanager-rest.cmdaa:8081 > Triggering savepoint for job 72f614340dc1a7416d0613362d1ef83b. > Waiting for response... > > ------------------------------------------------------------ > The program finished with the following exception: > > org.apache.flink.util.FlinkException: Triggering a savepoint for the job > 72f614340dc1a7416d0613362d1ef83b failed. > at > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777) > at > org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002) > at > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751) > at > org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) > at > org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) > at > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771) > ... 7 more > root@flink-client:/opt/flink# > > -- > Robert Cullen > 240-475-4490 >