[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072429#comment-17072429 ] Weike Dong commented on FLINK-16626: Thanks [~liyu] [~tison] for watching this issue, and I will try my best to fix and test it thoroughly : ) > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.ja
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072325#comment-17072325 ] Yu Li commented on FLINK-16626: --- Thanks for the note [~tison], will watch this one. And thanks for standing up [~kyledong], no mean to push but hopefully we could get a thorough fix soon (smile). > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concur
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072315#comment-17072315 ] Zili Chen commented on FLINK-16626: --- [~kyledong] Thanks for your interest. I've assigned to you. [~liyu] I raise this issue as blocker for 1.10.1 since it broke user interface({{flink cancel}}) and it's hopeful we fix it in days. Comment if you have any concern. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.Comple
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072311#comment-17072311 ] Weike Dong commented on FLINK-16626: Hi [~tison], I am quite interested in solving this issue and have been following this for a while, so I guess I can help fix this bug : ) > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:19
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070739#comment-17070739 ] Yang Wang commented on FLINK-16626: --- [~tison] The reason why {{InFlightRequestTracker#awaitAsync}} is called twice in your analysis is that we are using a same {{jobCancelTerminationHandler}} for two different handlers, {{JobCancelTerminationHandler}} and {{YarnCancelJobTerminationHeaders}}. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableF
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070720#comment-17070720 ] Yang Wang commented on FLINK-16626: --- [~tison] Thanks a lot for your analysis. I think you are right. However, i have checked the YARN code, even in the yarn-3.x, the proxy still could not support {{PATCH}} request. And the hadoop community does not have a clear plan to fix it. Because of this, we still can not remove the {{YarnCancelJobTerminationHeaders}}, i suggest to create a new {{JobCancellationHandler}} object for yarn-cancel. It should fix this problem. >> Side output Currently, the {{YarnCancelJobTerminationHeaders}} is only used via webui cancelation. The {{flink cancel}} is using {{JobCancellationHeaders}} and will directly talk to jobmanager(not using the YARN application proxy). For other deployment(standalone, K8s), they are also using {{YarnCancelJobTerminationHeaders}} for the webui cancelation. It is maybe a small trick for the name. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.ja
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069357#comment-17069357 ] tartarus commented on FLINK-16626: -- This is JM logs,but It's a surface phenomenon. {code:java} 2020-03-22 17:16:14,770 ERROR org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution[flink-akka.actor.default-dispatcher-20] - Failed to submit a listener notification task. Event loop shut down? java.util.concurrent.RejectedExecutionException: event executor terminated at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:855) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:340) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:333) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:766) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:764) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:421) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:149) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30) at org.apache.flink.runtime.rest.handler.util.HandlerUtils.sendResponse(HandlerUtils.java:224) at org.apache.flink.runtime.rest.handler.util.HandlerUtils.sendResponse(HandlerUtils.java:176) at org.apache.flink.runtime.rest.handler.util.HandlerUtils.sendResponse(HandlerUtils.java:91) at org.apache.flink.runtime.rest.handler.AbstractRestHandler.lambda$respondToRequest$0(AbstractRestHandler.java:78) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:656) at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:874) at akka.dispatch.OnComplete.internal(Future.scala:264) at akka.dispatch.OnComplete.internal(Future.scala:261) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} > Exception encountered when cancelling a job in yarn per
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069232#comment-17069232 ] Zili Chen commented on FLINK-16626: --- link to the analysis https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17063164&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17063164 https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17069225&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17069225 https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17069226&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17069226 https://issues.apache.org/jira/browse/FLINK-16637?focusedCommentId=17069227&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17069227 and closed FLINK-16637 as duplication > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) >
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069221#comment-17069221 ] Zili Chen commented on FLINK-16626: --- I found that {{org.apache.flink.runtime.rest.handler.InFlightRequestTracker#awaitAsync()}} is called more than once which cause {{Phaser}} synchronization meaningless. Further debugging... > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:191
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069188#comment-17069188 ] Zili Chen commented on FLINK-16626: --- For coming developer: you can duplicate {{YARNITCase}}, insert a cancel command, and add log for locally debugging. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFron
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069184#comment-17069184 ] Zili Chen commented on FLINK-16626: --- I've submitted a local YARN testing log, which indicates that "finalizeRequestProcessing" happens after cluster shutdown. Will investigate further later. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > Attachments: jobmanager.log > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069163#comment-17069163 ] Zili Chen commented on FLINK-16626: --- Could you reproduce that issue and share DEBUG level JM log? > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > {quote} > Actually, the job was cancelled. But the server also pr
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064690#comment-17064690 ] tartarus commented on FLINK-16626: -- We encountered the same problem. So I solved it temporarily and did not wait for the result of the cancel and returned directly. {code:java} //代码占位符 return CompletableFuture.completedFuture(EmptyResponseBody.getInstance()); // return terminationFuture.handle( // (Acknowledge ack, Throwable throwable) -> { // if (throwable != null) { // Throwable error = ExceptionUtils.stripCompletionException(throwable); // if (error instanceof TimeoutException) { // throw new CompletionException( // new RestHandlerException( // "Job cancellation timed out.", // HttpResponseStatus.REQUEST_TIMEOUT, error)); // } else if (error instanceof FlinkJobNotFoundException) { // throw new CompletionException( // new RestHandlerException( // "Job could not be found.", // HttpResponseStatus.NOT_FOUND, error)); // } else { // throw new CompletionException( // new RestHandlerException( // "Job cancellation failed: " + error.getMessage(), // HttpResponseStatus.INTERNAL_SERVER_ERROR, error)); // } // } else { // return EmptyResponseBody.getInstance(); // } // }); {code} But this is equivalent to changed the semantics of cancel!!! Equivalent to just sending a signal to JM. Do you have any good suggestions? > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroup
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061326#comment-17061326 ] chaiyongqiang commented on FLINK-16626: --- Maybe you'ra right. As far as i see, *flink stop* is going to be removed .Only *flink cancel* comes into this logic. Also there could be some other situation i haven't noticed. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliF
[jira] [Commented] (FLINK-16626) Exception encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060706#comment-17060706 ] Yang Wang commented on FLINK-16626: --- I think not only {{flink cancel}}, other operations may come across the similar problems. Since the dispatcher may exit before the Flink client has received the response. > Exception encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$
[jira] [Commented] (FLINK-16626) Exception Encountered when cancelling a job in yarn per-job mode
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060639#comment-17060639 ] chaiyongqiang commented on FLINK-16626: --- In my opinion, when the client call an cancel command, the server will run into the following logic : # call the JobCancellationHandler to handle the request # response to the client Codes as follows: {quote}@Override protected CompletableFuture respondToRequest(ChannelHandlerContext ctx, HttpRequest httpRequest, HandlerRequest handlerRequest, T gateway) { CompletableFuture response; try { response = handleRequest(handlerRequest, gateway); } catch (RestHandlerException e) { response = FutureUtils.completedExceptionally(e); } return response.thenAccept(resp -> HandlerUtils.sendResponse(ctx, httpRequest, resp, messageHeaders.getResponseStatusCode(), responseHeaders)); }{quote} But in the handleRequest function of `JobCancellationHandler`, the cluster will stop and call to shutdown the RestServerEndpoint in which the EventLoopGroup also will be shutdown. Then when calling `HandlerUtils.sendResponse` to response to the client, exception occurs. > Exception Encountered when cancelling a job in yarn per-job mode > > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Priority: Major > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.fli