[ https://issues.apache.org/jira/browse/AIRFLOW-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747145#comment-16747145 ]
ASF subversion and git services commented on AIRFLOW-3177: ---------------------------------------------------------- Commit bccd0ab344a999dc19ca5e2fe080a017677afe60 in airflow's branch refs/heads/v1-10-test from Greg Neiheisel [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=bccd0ab ] [AIRFLOW-3177] Change scheduler_heartbeat from gauge to counter (#4027) This updates the scheduler_heartbeat metric from a gauge to a counter to better support the statsd_exporter for usage with Prometheus. A counter allows users to track the rate of the heartbeat, and integrates with the exporter better. A crashing or down scheduler will no longer emit the metric, but the statsd_exporter will continue to show a 1 for the metric value. This fixes that issue because a counter will continually change, and the lack of change indicates an issue with the scheduler. Add statsd change notice in UPDATING.md > Change scheduler_heartbeat metric from gauge to counter > ------------------------------------------------------- > > Key: AIRFLOW-3177 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3177 > Project: Apache Airflow > Issue Type: Improvement > Components: scheduler > Affects Versions: 2.0.0 > Reporter: Greg Neiheisel > Assignee: Greg Neiheisel > Priority: Minor > Fix For: 1.10.1 > > > Currently, the scheduler_heartbeat metric exposed with the statsd integration > is a gauge. I'm proposing to change the gauge to a counter for a better > integration with Prometheus via the > [statsd_exporter|[https://github.com/prometheus/statsd_exporter].] > Rather than pointing Airflow at an actual statsd server, you can point it at > this exporter, which will accumulate the metrics and expose them to be > scraped by Prometheus at /metrics. The problem is that once this value is set > when the scheduler runs its first loop, it will always be exposed to > Prometheus as 1. The scheduler can crash, or be turned off and the statsd > exporter will report a 1 until it is restarted and rebuilds its internal > state. > By turning this metric into a counter, we can detect an issue with the > scheduler by graphing and alerting using a rate. If the rate of change of the > counter drops below what it should be at (determined by the > scheduler_heartbeat_secs setting), we can fire an alert. > This should be helpful for adoption in Kubernetes environments where > Prometheus is pretty much the standard. -- This message was sent by Atlassian JIRA (v7.6.3#76005)