[ https://issues.apache.org/jira/browse/FLINK-13787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910508#comment-16910508 ]
Kaibo Zhou edited comment on FLINK-13787 at 8/20/19 9:44 AM: ------------------------------------------------------------- I did some investigate, the reason is that flink's jobmanager's pod is controlled by the *Job* in kubernetes, and the task manager's pod is controlled by *Deployment*. When we run a standalone job on kubernetes, if we cancel the job, the jobmanager will shutdown and the pod will turn to a *Completed* state. So the jobmanager will call the close method of PrometheusPushGatewayReporter to delete the metrics on the pushgateway. However, the pod of the taskmanager is still running, and the shutdown logic inside the taskmanager is not called, the metrics of the taskmanager on the pushgateay will not be deleted. In the current implementation, the taskmanager pod cannot be directly killed/stopped, otherwise, kubernetes Deployment will restart it to ensure the replicas of the pods. I think the expected behavior is when the user cancels the job, taskmanager should run shutdown logic which calls MetricRegistryImpl.shutdown() to release resource. was (Author: kaibo.zhou): I did some investigate, the reason is that flink's jobmanager's pod is controlled by the *Job* in kubernetes, and the task manager's pod is controlled by *Deployment*. When we run a standalone job on kubernetes, if we cancel the job, the jobmanager will shutdown and the pod will turn to a *Completed* state. So the jobmanager will call the close method of PrometheusPushGatewayReporter to delete the metrics on the pushgateway. However, the pod of the taskmanager is still running, and the shutdown logic inside the taskmanager is not called, the metrics of the taskmanager on the pushgateay will not be deleted. In the current implementation, the taskmanager pod cannot be directly killed/stopped, otherwise, kubernetes Deployment will restart it to ensure the replicas of the pods. I think the expected behavior is users delete taskmanager Deployment, and kubernetes send terminate signal to TM pods. TM pod can trigger registered JVM shutdown hook to call MetricRegistryImpl.shutdown. Is it right? > PrometheusPushGatewayReporter does not cleanup TM metrics when run on > kubernetes > -------------------------------------------------------------------------------- > > Key: FLINK-13787 > URL: https://issues.apache.org/jira/browse/FLINK-13787 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics > Affects Versions: 1.7.2, 1.8.1, 1.9.0 > Reporter: Kaibo Zhou > Priority: Major > > I have run a flink job on kubernetes and use PrometheusPushGatewayReporter, I > can see the metrics from the flink jobmanager and taskmanager from the push > gateway's UI. > When I cancel the job, I found the jobmanager's metrics disappear, but the > taskmanager's metrics still exist, even though I have set the > _deleteOnShutdown_ to true_._ > The configuration is: > {code:java} > metrics.reporters: "prom" > metrics.reporter.prom.class: > "org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter" > metrics.reporter.prom.jobName: "WordCount" > metrics.reporter.prom.host: "localhost" > metrics.reporter.prom.port: "9091" > metrics.reporter.prom.randomJobNameSuffix: "true" > metrics.reporter.prom.filterLabelValueCharacters: "true" > metrics.reporter.prom.deleteOnShutdown: "true" > {code} > > Other people have also encountered this problem: > [https://stackoverflow.com/questions/54420498/flink-prometheus-push-gateway-reporter-delete-metrics-on-job-shutdown]. > And another similar issue: FLINK-11457. > > As prometheus is a very import metrics system on kubernetes, if we can solve > this problem, it is beneficial for users to monitor their flink jobs. -- This message was sent by Atlassian Jira (v8.3.2#803003)