I met this issue three months ago. Finally, we got the conclusion that is Prometheus push gateway can not handle high throughout metric data. But we solved the issue via service discovery. We changed the Prometheus metric reporter code, adding the registration logic, so the job can expose the host and port on discovery service. And then write a plugin for Prometheus that can get the service list to pull the metrics from the Flink jobs.
________________________________ From: 李佳宸 <lijiachen...@gmail.com> Sent: Wednesday, May 13, 2020 11:26:26 AM To: user@flink.apache.org <user@flink.apache.org> Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway Hi, I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter: metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter metrics.reporter.promgateway.host: localhost metrics.reporter.promgateway.port: 9091 metrics.reporter.promgateway.jobName: myJob metrics.reporter.promgateway.randomJobNameSuffix: true metrics.reporter.promgateway.deleteOnShutdown: true And the version information: Flink 1.9.1 Prometheus 2.18 PushGateway 1.2 & 0.9 (I had already try them both) I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever. It looks like: # HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Load gauge flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189 # HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Time gauge flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09 flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984 flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0 Ps: This cluster has one JobManager. In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow. Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway? Thanks!