[ https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Oscar Westra van Holthe - Kind updated FLINK-11457: --------------------------------------------------- Description: When cancelling a job running on a yarn based cluster and then shutting down the cluster, metrics on the push gateway are not deleted. Any thoughts on a solution? I'm happy to implement it, but Im not sure what the best solution would be. was: When using the PrometheusPushGatewayReporter, one has two options: * Use a fixed job name, which causes the jobmanager and taskmanager to overwrite each others metrics (i.e. last write wins, and you lose a lot of metrics) * Use a random suffix for the job name, which creates a lot of labels that have to be cleaned up manually The manual cleanup should not be necessary, but happens nonetheless when using a yarn cluster. A fix could be to add a suffix the job name, naming the nodes in a non-random manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, {{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is available), or some other stable (!) suffix. Related discussion: FLINK-9187 Any thoughts on a solution? I'm happy to implement it, but Im not sure what the best solution would be. > PrometheusPushGatewayReporter does not cleanup its metrics > ---------------------------------------------------------- > > Key: FLINK-11457 > URL: https://issues.apache.org/jira/browse/FLINK-11457 > Project: Flink > Issue Type: Bug > Reporter: Oscar Westra van Holthe - Kind > Priority: Major > > When cancelling a job running on a yarn based cluster and then shutting down > the cluster, metrics on the push gateway are not deleted. > > > > Any thoughts on a solution? I'm happy to implement it, but Im not sure what > the best solution would be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)