Hi Robert,
Below is the job manager log after issuing the “flink stop” command
====================
2020-07-24 19:24:12,388 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1 (type=CHECKPOINT) @ 1595618652138 for job
853c59916ac33dfbf46503b33289929e.
2020-07-24 19:24:13,914 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 1 for job 853c59916ac33dfbf46503b33289929e (7146 bytes in 1774 ms).
2020-07-24 19:27:59,299 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Triggering stop-with-savepoint for job
853c59916ac33dfbf46503b33289929e.
2020-07-24 19:27:59,655 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 2 (type=SYNC_SAVEPOINT) @ 1595618879302 for job
853c59916ac33dfbf46503b33289929e.
2020-07-24 19:28:00,962 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 2 for job 853c59916ac33dfbf46503b33289929e (7147 bytes in 1240 ms).
======================
It looks normal to me.
In the kubernetes deployment cluster, we set up a metric reporter, it has these
keys in the flink-config.yaml
# Metrics Reporter Configuration
metrics.reporters: wavefront
metrics.reporter.wavefront.interval: 60 SECONDS
metrics.reporter.wavefront.env: prod
metrics.reporter.wavefront.class: com.xxxxxxxxx.flink.monitor.WavefrontReporter
metrics.reporter.wavefront.host: xxxxxx
metrics.reporter.wavefront.token: xxxxxxxxxx
metrics.scope.tm: flink.taskmanager
Could this reporter interval interfere the job manager? I test the same job in
a standalone
Flink 1.11.0 without the reporter, Flink stop worked, and no hanging nor
timeout. Also the same reporter is used in 1.9.1 version where we didn’t have
issue on “flink stop”.
Thanks
Ivan
> On Jul 24, 2020, at 5:15 AM, Robert Metzger <[email protected]> wrote:
>
> Hi Ivan,
> thanks a lot for your message. Can you post the JobManager log here as well?
> It might contain additional information on the reason for the timeout.
>
> On Fri, Jul 24, 2020 at 4:03 AM Ivan Yang <[email protected]
> <mailto:[email protected]>> wrote:
> Hello everyone,
>
> We recently upgrade FLINK from 1.9.1 to 1.11.0. Found one strange behavior
> when we stop a job to a save point got following time out error.
> I checked Flink web console, the save point is created in s3 in 1 second.The
> job is fairly simple, so 1 second for savepoint generation is expected. We
> use kubernetes deployment. I clocked it, it’s about 60 seconds when it
> returns this error. So afterwards, the job is hanging (it still says running,
> but actually not doing anything). I need run another command to cancel it.
> Anyone has idea what’s going on here? BTW, “flink stop works” in 1.19.1 for
> us before
>
>
>
> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$ flink stop
> 88d9b46f59d131428e2a18c9c7b3aa3f
> Suspending job "88d9b46f59d131428e2a18c9c7b3aa3f" with a savepoint.
>
> ------------------------------------------------------------
> The program finished with the following exception:
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "88d9b46f59d131428e2a18c9c7b3aa3f".
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
> at
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
> Caused by: java.util.concurrent.TimeoutException
> at
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
> ... 9 more
> flink@flink-jobmanager-fcf5d84c5-sz4wk:~$
>
>
> Thanks in advance,
> Ivan