Hi,
Recently I have encountered a strange behavior of Flink on YARN, which is
that when I try to cancel a Flink job running in per-job mode on YARN using
commands like
"cancel -m yarn-cluster
-yid application_1559388106022_9412 ed7e2e0ab0a7316c1b65df6047bc6aae"
the client happily found and connected to ResourceManager and then stucks
at
Found Web Interface 172.28.28.3:50099
of application 'application_1559388106022_9412'.
And after one minute, an exception is thrown at the client side:
Caused by: org.apache.flink.util.FlinkException: Could not cancel job
ed7e2e0ab0a7316c1b65df6047bc6aae.
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
at
org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917)
at
org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
... 20 more
Caused by: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
... 27 more
Then I discovered that the YARN app has already terminated with FINISHED
state and KILLED final status, like below.
[image: image.png]
And after digging into the log of this finished YARN app, I have found that
TaskManager had already received the SIGTERM signal and terminated
gracefully.
org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15:
SIGTERM. Shutting down as requested.
Also, the log of JobManager shows that it terminated with exit code 0.
Terminating cluster entrypoint process YarnJobClusterEntrypoint with exit code 0
However, the JobManager did not return anything to the client before its
shutdown, which is different from previous versions (like Flink 1.9).
I wonder if this is a new bug on the flink-clients or flink-yarn module?
Thank you : )
Sincerely,
Weike