[jira] [Comment Edited] (FLINK-13787) PrometheusPushGatewayReporter does not cleanup TM metrics when run on kubernetes

Kaibo Zhou (Jira) Tue, 20 Aug 2019 02:45:15 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910508#comment-16910508
 ]


Kaibo Zhou edited comment on FLINK-13787 at 8/20/19 9:44 AM:
-------------------------------------------------------------

I did some investigate, the reason is that flink's jobmanager's pod is 
controlled by the *Job* in kubernetes, and the task manager's pod is controlled 
by *Deployment*.

When we run a standalone job on kubernetes, if we cancel the job, the 
jobmanager will shutdown and the pod will turn to a *Completed* state. So the 
jobmanager will call the close method of PrometheusPushGatewayReporter to 
delete the metrics on the pushgateway. However, the pod of the taskmanager is 
still running, and the shutdown logic inside the taskmanager is not called, the 
metrics of the taskmanager on the pushgateay will not be deleted.

In the current implementation, the taskmanager pod cannot be directly 
killed/stopped, otherwise, kubernetes Deployment will restart it to ensure the 
replicas of the pods. 

I think the expected behavior is when the user cancels the job, taskmanager 
should run shutdown logic which calls MetricRegistryImpl.shutdown() to release 
resource. 

 


was (Author: kaibo.zhou):
I did some investigate, the reason is that flink's jobmanager's pod is 
controlled by the *Job* in kubernetes, and the task manager's pod is controlled 
by *Deployment*.

When we run a standalone job on kubernetes, if we cancel the job, the 
jobmanager will shutdown and the pod will turn to a *Completed* state. So the 
jobmanager will call the close method of PrometheusPushGatewayReporter to 
delete the metrics on the pushgateway. However, the pod of the taskmanager is 
still running, and the shutdown logic inside the taskmanager is not called, the 
metrics of the taskmanager on the pushgateay will not be deleted.

In the current implementation, the taskmanager pod cannot be directly 
killed/stopped, otherwise, kubernetes Deployment will restart it to ensure the 
replicas of the pods.

 

I think the expected behavior is users delete taskmanager Deployment, and 
kubernetes send terminate signal to TM pods. TM pod can trigger registered JVM 
shutdown hook to call MetricRegistryImpl.shutdown. Is it right?

 

> PrometheusPushGatewayReporter does not cleanup TM metrics when run on 
> kubernetes
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-13787
>                 URL: https://issues.apache.org/jira/browse/FLINK-13787
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics
>    Affects Versions: 1.7.2, 1.8.1, 1.9.0
>            Reporter: Kaibo Zhou
>            Priority: Major
>
> I have run a flink job on kubernetes and use PrometheusPushGatewayReporter, I 
> can see the metrics from the flink jobmanager and taskmanager from the push 
> gateway's UI.
> When I cancel the job, I found the jobmanager's metrics disappear, but the 
> taskmanager's metrics still exist, even though I have set the 
> _deleteOnShutdown_ to true_._
> The configuration is:
> {code:java}
> metrics.reporters: "prom"
> metrics.reporter.prom.class: 
> "org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter"
> metrics.reporter.prom.jobName: "WordCount"
> metrics.reporter.prom.host: "localhost"
> metrics.reporter.prom.port: "9091"
> metrics.reporter.prom.randomJobNameSuffix: "true"
> metrics.reporter.prom.filterLabelValueCharacters: "true"
> metrics.reporter.prom.deleteOnShutdown: "true"
> {code}
>  
> Other people have also encountered this problem: 
> [https://stackoverflow.com/questions/54420498/flink-prometheus-push-gateway-reporter-delete-metrics-on-job-shutdown].
>   And another similar issue: FLINK-11457.
>  
> As prometheus is a very import metrics system on kubernetes, if we can solve 
> this problem, it is beneficial for users to monitor their flink jobs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (FLINK-13787) PrometheusPushGatewayReporter does not cleanup TM metrics when run on kubernetes

Reply via email to