[ https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280055#comment-17280055 ]
jiguodai commented on FLINK-21309: ---------------------------------- my solution is as below : {code:java} public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter implements Scheduled { @Override public void report() { try { // change push to pushAdd pushGateway.pushAdd(CollectorRegistry.defaultRegistry, jobName, groupingKey); } catch (Exception e) { log.warn("Failed to push metrics to PushGateway with jobName {}, groupingKey {}.", jobName, groupingKey, e); } } } {code} > Metrics of JobManager and TaskManager overwrite each other in pushgateway > ------------------------------------------------------------------------- > > Key: FLINK-21309 > URL: https://issues.apache.org/jira/browse/FLINK-21309 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics > Affects Versions: 1.9.0, 1.10.0, 1.11.0 > Environment: 1. Components : > Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn > 2. Metrics Configuration in flink-conf.yaml : > {code:java} > metrics.reporter.promgateway.class: > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter > metrics.reporter.promgateway.jobName: myjob > metrics.reporter.promgateway.randomJobNameSuffix: false{code} > > Reporter: jiguodai > Priority: Major > Attachments: image-2021-02-05-21-07-42-292.png > > Original Estimate: 12h > Remaining Estimate: 12h > > When a flink job run on yarn, metrics of jobmanager and taskmanagers will > overwrite each other. The phenomenon is that on one second you can find only > jobmanager metrics on pushgateway web ui, while on the next second you can > find only taskmanager metrics on pushgateway web ui, these two kinds of > metrics appear alternately. One metric of taskmanager on grafana will be like > below intermittently (this taskmanager metric disappear on grafana when > jobmanager metrics overwrite taskmanager metrics): > !image-2021-02-05-21-07-42-292.png! > The real reason is that Flink PrometheusPushGatewayReporter use PUT style > instead of POST style to push metrics to pushgateway, what's more, > taskmanagers and jobmanager use the same jobName (the only grouping key) > which we configured in flink-conf.yaml. > Althought REST URLs are same as below, > {code:java} > /metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>} > {code} > PUT and POST caused different results, as we can see below : > * PUT is used to push a group of metrics. All metrics with the grouping key > specified in the URL are replaced by the metrics pushed with PUT. > * POST works exactly like the PUT method but only metrics with the same name > as the newly pushed metrics are replaced. > For these reasons, it's better to use POST style to push metrics to > pushgateway to prevent jobmanager metrics and taskmanager metrics from > overwriting each other, so that we can get continuous graph on grafana. Maybe > you will say that we can set > {code:java} > metrics.reporter.promgateway.randomJobNameSuffix: true{code} > in flink-conf.yaml, in this way, jobName from different nodes will has a > random suffix and metrics will not overwrite each other any more. While we > should be aware that most of users tend to use jobName as filter condition in > PromQL, and using regular expressions to find exact jobName will degrade the > speed of data retrieval in prometheus. > Everytime some body ask why metrics on grafana is discontinuous on Flink > mailing list, i will tell him that you should change the style of pushing > metrics to pushgateway from PUT to POST and then repackage the > flink-metrics-prometheus module. So, why don't we solve the problem > permanently now ? I hope to have the chance to solve the problem, sincerely. -- This message was sent by Atlassian Jira (v8.3.4#803005)