[ https://issues.apache.org/jira/browse/FLINK-28497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lihe ma closed FLINK-28497. --------------------------- Resolution: Duplicate > resource leak when job failed with unknown status In Application Mode > --------------------------------------------------------------------- > > Key: FLINK-28497 > URL: https://issues.apache.org/jira/browse/FLINK-28497 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.13.1 > Reporter: lihe ma > Priority: Minor > > I found a job restarted for thousands of times, and jobmanager tried to > create a new taskmanager pod every time. The jobmanager restarted because > submitted with duplicate job id[1] (we preset the jobId rather than > generate), but I hadn't save the logs unfortunately. > this job requires one taskmanager pod in normal circumstances, but thousands > of pods were leaked finally. > !image-2022-07-12-11-02-43-009.png|width=666,height=366! > In application mode, cluster resources will be released when job finished in > succeeded, failed or canceled status[2][3] . When some exception happen, the > job may be terminated in unknown status[4] . > In this case, the job exited with unknown status , without releasing > taskmanager pods. So is it reasonable to not release taskmanager when job > exited in unknown status ? > > > one line in original logs: > 2022-07-01 09:45:40,712 [main] INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster > entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445. > > [1] > [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452] > [2] > [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91] > [3] > [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175] > [4] > [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39] > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)