[jira] [Updated] (FLINK-16626) Prevent REST handler from being closed more than once

2020-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-16626:
---
Labels: pull-request-available  (was: )

> Prevent REST handler from being closed more than once
> -
>
> Key: FLINK-16626
> URL: https://issues.apache.org/jira/browse/FLINK-16626
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.10.0
>Reporter: chaiyongqiang
>Assignee: Weike Dong
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.1, 1.11.0
>
> Attachments: jobmanager.log, patch.diff
>
>
> In Flink 1.10.0 release, job cancellation can be problematic, as users 
> frequently experience java.util.concurrent.TimeoutException at the client 
> side, because the REST endpoint closes pre-maturely before sending out the 
> response, this is because the jobCancellationHandler is incorrectly reused 
> and closed twice.
>  
> When executing the following command to stop a flink job with yarn per-job 
> mode, the client keeps retrying untill timeout (1 minutes)and exit with 
> failure. But the job stops successfully.
>  Command :
> {noformat}
> flink cancel $jobId yid appId
> {noformat}
>  The exception on the client side is :
> {quote}{quote} 
> {quote}
> 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Sending request of class class 
> org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
> ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
>  ...
>  2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Sending request of class class 
> org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
> ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
>  2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Shutting down rest endpoint.
>  2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Sending request of class class 
> org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
> ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
>  2020-03-17 12:33:14,077 DEBUG 
> org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 
> thread-local buffer(s) from thread: flink-rest-client-netty-thread-1
>  2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Rest endpoint shutdown complete.
>  2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - 
> Error while running the command.
>  org.apache.flink.util.FlinkException: Could not cancel job 
> cc61033484d4c0e7a27a8a2a36f4de7a.
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
>  at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
>  Caused by: java.util.concurrent.TimeoutException
>  at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
>  ... 9 more
> 
>  The program finished with the following exception:
> org.apache.flink.util.FlinkException: Could not cancel job 
> cc61033484d4c0e7a27a8a2a36f4de7a.
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
>  at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at 

[jira] [Updated] (FLINK-16626) Prevent REST handler from being closed more than once

2020-04-05 Thread Weike Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weike Dong updated FLINK-16626:
---
Description: 
In Flink 1.10.0 release, job cancellation can be problematic, as users 
frequently experience java.util.concurrent.TimeoutException at the client side, 
because the REST endpoint closes pre-maturely before sending out the response, 
this is because the jobCancellationHandler is incorrectly reused and closed 
twice.
 
When executing the following command to stop a flink job with yarn per-job 
mode, the client keeps retrying untill timeout (1 minutes)and exit with 
failure. But the job stops successfully.
 Command :
{noformat}
flink cancel $jobId yid appId
{noformat}
 The exception on the client side is :
{quote}{quote} 
{quote}
2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - 
Sending request of class class 
org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
 ...
 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - 
Sending request of class class 
org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - 
Shutting down rest endpoint.
 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - 
Sending request of class class 
org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
 2020-03-17 12:33:14,077 DEBUG 
org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 
thread-local buffer(s) from thread: flink-rest-client-netty-thread-1
 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest 
endpoint shutdown complete.
 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error 
while running the command.
 org.apache.flink.util.FlinkException: Could not cancel job 
cc61033484d4c0e7a27a8a2a36f4de7a.
 at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
 at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
 at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
 at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
 at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
 at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
 at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
 Caused by: java.util.concurrent.TimeoutException
 at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
 at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
 ... 9 more


 The program finished with the following exception:

org.apache.flink.util.FlinkException: Could not cancel job 
cc61033484d4c0e7a27a8a2a36f4de7a.
 at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
 at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
 at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
 at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
 at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
 at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
 at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
 Caused by: java.util.concurrent.TimeoutException
 at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
 at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
 ... 9 more
{quote}
Actually, the job was cancelled. But the server also prints some exception:

 
{quote}2020-03-17 12:25:13,754 ERROR [flink-akka.actor.default-dispatcher-17] 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:766)
 - Failed to submit a listener notification task. Event loop shut down? 
java.util.concurrent.RejectedExecutionException: event executor terminated at 

[jira] [Updated] (FLINK-16626) Prevent REST handler from being closed more than once

2020-04-05 Thread Weike Dong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weike Dong updated FLINK-16626:
---
Summary: Prevent REST handler from being closed more than once  (was: 
Exception encountered when cancelling a job in yarn per-job mode)

> Prevent REST handler from being closed more than once
> -
>
> Key: FLINK-16626
> URL: https://issues.apache.org/jira/browse/FLINK-16626
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.10.0
>Reporter: chaiyongqiang
>Assignee: Weike Dong
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: jobmanager.log, patch.diff
>
>
> When executing the following command to stop a flink job with yarn per-job 
> mode, the client keeps retrying untill timeout (1 minutes)and exit with 
> failure. But the job stops successfully.
>  Command :
> {noformat}
> flink cancel $jobId yid appId
> {noformat}
>  The exception on the client side is :
> {quote}bq. 
> 2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Sending request of class class 
> org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
> ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
> ...
> 2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Sending request of class class 
> org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
> ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
> 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Shutting down rest endpoint.
> 2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient - 
> Sending request of class class 
> org.apache.flink.runtime.rest.messages.EmptyRequestBody to 
> ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
> 2020-03-17 12:33:14,077 DEBUG 
> org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2 
> thread-local buffer(s) from thread: flink-rest-client-netty-thread-1
> 2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest 
> endpoint shutdown complete.
> 2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error 
> while running the command.
> org.apache.flink.util.FlinkException: Could not cancel job 
> cc61033484d4c0e7a27a8a2a36f4de7a.
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
>  at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.util.concurrent.TimeoutException
>  at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
>  ... 9 more
> 
>  The program finished with the following exception:
> org.apache.flink.util.FlinkException: Could not cancel job 
> cc61033484d4c0e7a27a8a2a36f4de7a.
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
>  at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.util.concurrent.TimeoutException
>  at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>  at 
>