[jira] [Comment Edited] (FLINK-19909) Flink application in attach mode could not terminate when the only job is canceled
[ https://issues.apache.org/jira/browse/FLINK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225613#comment-17225613 ] Kostas Kloudas edited comment on FLINK-19909 at 11/3/20, 6:52 PM: -- There is an open PR [~xintongsong]. It would be great if [~fly_in_gis] who reported it could have a look and verify that this fixes it. was (Author: kkl0u): There is an open PR. It would be great if [~fly_in_gis] who reported it could have a look and verify that this fixes it. > Flink application in attach mode could not terminate when the only job is > canceled > -- > > Key: FLINK-19909 > URL: https://issues.apache.org/jira/browse/FLINK-19909 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.12.0, 1.11.3 >Reporter: Yang Wang >Assignee: Kostas Kloudas >Priority: Blocker > Labels: pull-request-available > Fix For: 1.12.0, 1.11.3 > > Attachments: log.jm > > > Currently, the Yarn and Kubernetes application in attach mode could not > terminate the Flink cluster after the only job is canceled. Because we are > throwing {{ApplicationExecutionException}} in > {{ApplicationDispatcherBootstrap#runApplicationEntryPoint}}. However, we are > only checking {{ApplicationFailureException}} in > {{runApplicationAndShutdownClusterAsync}}. Then we will go to fatal error > handler which make the jobmanager directly exits. And it has no chance to > deregister itself to the cluster manager(Yarn/Kubernetes). That means the > jobmanager will be relaunched by cluster manager again and again until it > exhausts the retry attempts. > > cc [~kkl0u], I am not sure is this an expected change? I think it could work > in 1.11. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-19909) Flink application in attach mode could not terminate when the only job is canceled
[ https://issues.apache.org/jira/browse/FLINK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224523#comment-17224523 ] Kostas Kloudas edited comment on FLINK-19909 at 11/2/20, 9:16 AM: -- Yes [~trohrmann] I phrased it wrongly before. With the "restart" in parenthesis I meant transient failures like framework failures from which the application is expected to restart. I am working on a fix and ping you or [~fly_in_gis] for a review. was (Author: kkl0u): Yes [~trohrmann] I phrased it wrongly before. But this is why I put restart in parenthesis. I am working on a fix and ping you or [~fly_in_gis] for a review. > Flink application in attach mode could not terminate when the only job is > canceled > -- > > Key: FLINK-19909 > URL: https://issues.apache.org/jira/browse/FLINK-19909 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.12.0, 1.11.3 >Reporter: Yang Wang >Assignee: Kostas Kloudas >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > Attachments: log.jm > > > Currently, the Yarn and Kubernetes application in attach mode could not > terminate the Flink cluster after the only job is canceled. Because we are > throwing {{ApplicationExecutionException}} in > {{ApplicationDispatcherBootstrap#runApplicationEntryPoint}}. However, we are > only checking {{ApplicationFailureException}} in > {{runApplicationAndShutdownClusterAsync}}. Then we will go to fatal error > handler which make the jobmanager directly exits. And it has no chance to > deregister itself to the cluster manager(Yarn/Kubernetes). That means the > jobmanager will be relaunched by cluster manager again and again until it > exhausts the retry attempts. > > cc [~kkl0u], I am not sure is this an expected change? I think it could work > in 1.11. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-19909) Flink application in attach mode could not terminate when the only job is canceled
[ https://issues.apache.org/jira/browse/FLINK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224241#comment-17224241 ] Kostas Kloudas edited comment on FLINK-19909 at 11/2/20, 8:03 AM: -- Thanks for opening this [~fly_in_gis]! I changed this after a comment on my PR (https://github.com/apache/flink/pull/13699) after a comment during review. Before, in this case the error handler I was using was completing the shutdown future of the {{Dispatcher}} exceptionally (see https://github.com/apache/flink/pull/13699#discussion_r508494946). BTW if the job gets cancelled, shouldn't we go throw [here|https://github.com/kl0u/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L280], which is expected to put the correct exception? was (Author: kkl0u): Thanks for opening this [~fly_in_gis]! I changed this after a comment on my PR (https://github.com/apache/flink/pull/13699) after a comment during review. Before, in this case the error handler I was using was completing the shutdown future of the {{Dispatcher}} exceptionally (see https://github.com/apache/flink/pull/13699#discussion_r508494946). I think this would solve the problem. Do you agree [~fly_in_gis]? BTW if the job gets cancelled, shouldn't we go throw [here|https://github.com/kl0u/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L280], which is expected to put the correct exception? > Flink application in attach mode could not terminate when the only job is > canceled > -- > > Key: FLINK-19909 > URL: https://issues.apache.org/jira/browse/FLINK-19909 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.12.0, 1.11.3 >Reporter: Yang Wang >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > Attachments: log.jm > > > Currently, the Yarn and Kubernetes application in attach mode could not > terminate the Flink cluster after the only job is canceled. Because we are > throwing {{ApplicationExecutionException}} in > {{ApplicationDispatcherBootstrap#runApplicationEntryPoint}}. However, we are > only checking {{ApplicationFailureException}} in > {{runApplicationAndShutdownClusterAsync}}. Then we will go to fatal error > handler which make the jobmanager directly exits. And it has no chance to > deregister itself to the cluster manager(Yarn/Kubernetes). That means the > jobmanager will be relaunched by cluster manager again and again until it > exhausts the retry attempts. > > cc [~kkl0u], I am not sure is this an expected change? I think it could work > in 1.11. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-19909) Flink application in attach mode could not terminate when the only job is canceled
[ https://issues.apache.org/jira/browse/FLINK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224253#comment-17224253 ] Kostas Kloudas edited comment on FLINK-19909 at 11/1/20, 12:58 PM: --- [~fly_in_gis] We want different behaviour when we cancel a job (delete HA data) and when a job fails (restart). So even if we throw {{ApplicationFailureException}} in the {{runApplicationEntryPoint}}, in the case of cancellation it has to contain the {{Status}} of the job as {{Cancelled}} when the job gets cancelled. In your case, you see a {{JobCancellationException}} being thrown in the {{runApplicationEntryPoint()}} ? was (Author: kkl0u): [~fly_in_gis] We want different behaviour when we cancel a job (delete HA data) and when a job fails (restart). So even if we throw {{ApplicationFailureException}} in the {{runApplicationEntryPoint}}, in the case of cancellation it has to contain the `Status` of the job as `Cancelled` when the job gets cancelled. In your case, you see a`JobCancellationException` being thrown in the `runApplicationEntryPoint()` ? > Flink application in attach mode could not terminate when the only job is > canceled > -- > > Key: FLINK-19909 > URL: https://issues.apache.org/jira/browse/FLINK-19909 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.12.0 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.12.0 > > > Currently, the Yarn and Kubernetes application in attach mode could not > terminate the Flink cluster after the only job is canceled. Because we are > throwing {{ApplicationExecutionException}} in > {{ApplicationDispatcherBootstrap#runApplicationEntryPoint}}. However, we are > only checking {{ApplicationFailureException}} in > {{runApplicationAndShutdownClusterAsync}}. Then we will go to fatal error > handler which make the jobmanager directly exits. And it has no chance to > deregister itself to the cluster manager(Yarn/Kubernetes). That means the > jobmanager will be relaunched by cluster manager again and again until it > exhausts the retry attempts. > > cc [~kkl0u], I am not sure is this an expected change? I think it could work > in 1.11. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-19909) Flink application in attach mode could not terminate when the only job is canceled
[ https://issues.apache.org/jira/browse/FLINK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224244#comment-17224244 ] Yang Wang edited comment on FLINK-19909 at 11/1/20, 10:18 AM: -- [~kkl0u] Thanks for your response. Hmm. Actually, I find this issue when I am testing K8s native HA. I have two questions here. * Why we are throwing {{ApplicationExecutionException}} in {{ApplicationDispatcherBootstrap#runApplicationEntryPoint}}, not {{ApplicationFailureException}}? If we submit the Flink application with {{execution.attached: true}}, and we cancel the job, then we still go into the fatal error handler. Because we only try to find the {{ApplicationFailureException}}. * Using the {{ClusterEntrypoint#onFatalError}} is not reasonable, it will make us have no chance to deregister the Flink application. was (Author: fly_in_gis): [~kkl0u] Thanks for your response. Hmm. Actually, I find this issue when I am test K8s native HA. I have two questions here. * Why we are throwing {{ApplicationExecutionException}} in {{ApplicationDispatcherBootstrap#runApplicationEntryPoint}}, not {{ApplicationFailureException}}? If we submit the Flink application with {{execution.attached: true}}, and we cancel the job, then we still go into the fatal error handler. Because we only try to find the {{ApplicationFailureException}}. * Using the {{ClusterEntrypoint#onFatalError}} is not reasonable, it will make us have no chance to deregister the Flink application. > Flink application in attach mode could not terminate when the only job is > canceled > -- > > Key: FLINK-19909 > URL: https://issues.apache.org/jira/browse/FLINK-19909 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.12.0 >Reporter: Yang Wang >Priority: Critical > Fix For: 1.12.0 > > > Currently, the Yarn and Kubernetes application in attach mode could not > terminate the Flink cluster after the only job is canceled. Because we are > throwing {{ApplicationExecutionException}} in > {{ApplicationDispatcherBootstrap#runApplicationEntryPoint}}. However, we are > only checking {{ApplicationFailureException}} in > {{runApplicationAndShutdownClusterAsync}}. Then we will go to fatal error > handler which make the jobmanager directly exits. And it has no chance to > deregister itself to the cluster manager(Yarn/Kubernetes). That means the > jobmanager will be relaunched by cluster manager again and again until it > exhausts the retry attempts. > > cc [~kkl0u], I am not sure is this an expected change? I think it could work > in 1.11. -- This message was sent by Atlassian Jira (v8.3.4#803005)