[ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186532#comment-17186532 ]
Ufuk Celebi edited comment on FLINK-18828 at 8/28/20, 1:19 PM: --------------------------------------------------------------- [~fly_in_gis] Thanks for the pointers. The Flink job would only transition to FAILED when the Flink-level restart strategy has been exhausted. In your example for fixed-delay with 3 attempts, the first three restarts would _not_ result in the container to exit, but only on the 4th failure would the job transition to FAILED and the container exit. I think the bigger problem with my proposal to set the policy to Never is that it would not restart in other failure scenarios (e.g. OOM killed). So overall, I don't think it's a viable option. So overall, I don't see a good way around this problem without your proposed change. --- Maybe as a follow-up we want to resurrect https://issues.apache.org/jira/browse/FLINK-10948 ? That way, users would at least be able to determine the final Flink job status. was (Author: uce): [~fly_in_gis] Thanks for the pointers. The Flink job would only transition to FAILED when the Flink-level restart strategy has been exhausted. In your example for fixed-delay with 3 attempts, the first three restarts would _not_ result in the container to exit, but only on the 4th failure would the job transition to FAILED and the container exit. I think the bigger problem with my proposal to set the policy to Never is that it would not restart in other failure scenarios (e.g. OOM killed). So overall, I don't think it's a viable option. So overall, I don't see a good way around this problem without your proposed change. --- Maybe as a follow-up we want to resurrect https://issues.apache.org/jira/browse/FLINK-10948? That way, users would at least be able to determine the final Flink job status. > Terminate jobmanager process with zero exit code to avoid unexpected > restarting by K8s > -------------------------------------------------------------------------------------- > > Key: FLINK-18828 > URL: https://issues.apache.org/jira/browse/FLINK-18828 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.10.1, 1.12.0, 1.11.1 > Reporter: Yang Wang > Priority: Major > Fix For: 1.12.0, 1.11.2, 1.10.3 > > > Currently, Flink jobmanager process terminates with a non-zero exit code if > the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s > deployment, since non-zero exit code will cause unexpected restarting. Also > from a framework's perspective, a FAILED job does not mean that Flink has > failed and, hence, the return code could still be 0. > > Note: > This is a special case for standalone K8s deployment. For > standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is > harmless. And a non-zero exit code could help to check the job result quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)