[jira] [Comment Edited] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

Ufuk Celebi (Jira) Fri, 28 Aug 2020 06:44:39 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186532#comment-17186532
 ]


Ufuk Celebi edited comment on FLINK-18828 at 8/28/20, 1:19 PM:
---------------------------------------------------------------

[~fly_in_gis] Thanks for the pointers. The Flink job would only transition to 
FAILED when the Flink-level restart strategy has been exhausted. In your 
example for fixed-delay with 3 attempts, the first three restarts would _not_ 
result in the container to exit, but only on the 4th failure would the job 
transition to FAILED and the container exit.

I think the bigger problem with my proposal to set the policy to Never is that 
it would not restart in other failure scenarios (e.g. OOM killed). So overall, 
I don't think it's a viable option.

So overall, I don't see a good way around this problem without your proposed 
change.

---

Maybe as a follow-up we want to resurrect 
https://issues.apache.org/jira/browse/FLINK-10948 ? That way, users would at 
least be able to determine the final Flink job status.


was (Author: uce):
[~fly_in_gis] Thanks for the pointers. The Flink job would only transition to 
FAILED when the Flink-level restart strategy has been exhausted. In your 
example for fixed-delay with 3 attempts, the first three restarts would _not_ 
result in the container to exit, but only on the 4th failure would the job 
transition to FAILED and the container exit.

I think the bigger problem with my proposal to set the policy to Never is that 
it would not restart in other failure scenarios (e.g. OOM killed). So overall, 
I don't think it's a viable option.

So overall, I don't see a good way around this problem without your proposed 
change.

---

Maybe as a follow-up we want to resurrect 
https://issues.apache.org/jira/browse/FLINK-10948? That way, users would at 
least be able to determine the final Flink job status.

> Terminate jobmanager process with zero exit code to avoid unexpected 
> restarting by K8s
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-18828
>                 URL: https://issues.apache.org/jira/browse/FLINK-18828
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.1, 1.12.0, 1.11.1
>            Reporter: Yang Wang
>            Priority: Major
>             Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if 
> the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s 
> deployment, since non-zero exit code will cause unexpected restarting. Also 
> from a framework's perspective, a FAILED job does not mean that Flink has 
> failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For 
> standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is 
> harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

Reply via email to