[jira] [Updated] (FLINK-16626) Prevent REST handler from being closed more than once
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-16626: --- Labels: pull-request-available (was: ) > Prevent REST handler from being closed more than once > - > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Blocker > Labels: pull-request-available > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > In Flink 1.10.0 release, job cancellation can be problematic, as users > frequently experience java.util.concurrent.TimeoutException at the client > side, because the REST endpoint closes pre-maturely before sending out the > response, this is because the jobCancellationHandler is incorrectly reused > and closed twice. > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}{quote} > {quote} > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - > Rest endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - > Error while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at
[jira] [Updated] (FLINK-16626) Prevent REST handler from being closed more than once
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weike Dong updated FLINK-16626: --- Description: In Flink 1.10.0 release, job cancellation can be problematic, as users frequently experience java.util.concurrent.TimeoutException at the client side, because the REST endpoint closes pre-maturely before sending out the response, this is because the jobCancellationHandler is incorrectly reused and closed twice. When executing the following command to stop a flink job with yarn per-job mode, the client keeps retrying untill timeout (1 minutes)and exit with failure. But the job stops successfully. Command : {noformat} flink cancel $jobId yid appId {noformat} The exception on the client side is : {quote}{quote} {quote} 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - Sending request of class class org.apache.flink.runtime.rest.messages.EmptyRequestBody to ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel ... 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - Sending request of class class org.apache.flink.runtime.rest.messages.EmptyRequestBody to ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint. 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - Sending request of class class org.apache.flink.runtime.rest.messages.EmptyRequestBody to ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel 2020-03-17 12:33:14,077 DEBUG org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete. 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.util.FlinkException: Could not cancel job cc61033484d4c0e7a27a8a2a36f4de7a. at org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) ... 9 more The program finished with the following exception: org.apache.flink.util.FlinkException: Could not cancel job cc61033484d4c0e7a27a8a2a36f4de7a. at org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) ... 9 more {quote} Actually, the job was cancelled. But the server also prints some exception: {quote}2020-03-17 12:25:13,754 ERROR [flink-akka.actor.default-dispatcher-17] org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:766) - Failed to submit a listener notification task. Event loop shut down? java.util.concurrent.RejectedExecutionException: event executor terminated at
[jira] [Updated] (FLINK-16626) Prevent REST handler from being closed more than once
[ https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weike Dong updated FLINK-16626: --- Summary: Prevent REST handler from being closed more than once (was: Exception encountered when cancelling a job in yarn per-job mode) > Prevent REST handler from being closed more than once > - > > Key: FLINK-16626 > URL: https://issues.apache.org/jira/browse/FLINK-16626 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.10.0 >Reporter: chaiyongqiang >Assignee: Weike Dong >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: jobmanager.log, patch.diff > > > When executing the following command to stop a flink job with yarn per-job > mode, the client keeps retrying untill timeout (1 minutes)and exit with > failure. But the job stops successfully. > Command : > {noformat} > flink cancel $jobId yid appId > {noformat} > The exception on the client side is : > {quote}bq. > 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > ... > 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Shutting down rest endpoint. > 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - > Sending request of class class > org.apache.flink.runtime.rest.messages.EmptyRequestBody to > ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel > 2020-03-17 12:33:14,077 DEBUG > org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 > thread-local buffer(s) from thread: flink-rest-client-netty-thread-1 > 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest > endpoint shutdown complete. > 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error > while running the command. > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 9 more > > The program finished with the following exception: > org.apache.flink.util.FlinkException: Could not cancel job > cc61033484d4c0e7a27a8a2a36f4de7a. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at >