[ 
https://issues.apache.org/jira/browse/AIRFLOW-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747145#comment-16747145
 ] 

ASF subversion and git services commented on AIRFLOW-3177:
----------------------------------------------------------

Commit bccd0ab344a999dc19ca5e2fe080a017677afe60 in airflow's branch 
refs/heads/v1-10-test from Greg Neiheisel
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=bccd0ab ]

[AIRFLOW-3177] Change scheduler_heartbeat from gauge to counter (#4027)

This updates the scheduler_heartbeat metric from a gauge to a counter to
better support the statsd_exporter for usage with Prometheus. A counter
allows users to track the rate of the heartbeat, and integrates with the
exporter better. A crashing or down scheduler will no longer emit the
metric, but the statsd_exporter will continue to show a 1 for the metric
value. This fixes that issue because a counter will continually change,
and the lack of change indicates an issue with the scheduler.

Add statsd change notice in UPDATING.md


> Change scheduler_heartbeat metric from gauge to counter
> -------------------------------------------------------
>
>                 Key: AIRFLOW-3177
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3177
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>    Affects Versions: 2.0.0
>            Reporter: Greg Neiheisel
>            Assignee: Greg Neiheisel
>            Priority: Minor
>             Fix For: 1.10.1
>
>
> Currently, the scheduler_heartbeat metric exposed with the statsd integration 
> is a gauge. I'm proposing to change the gauge to a counter for a better 
> integration with Prometheus via the 
> [statsd_exporter|[https://github.com/prometheus/statsd_exporter].]
> Rather than pointing Airflow at an actual statsd server, you can point it at 
> this exporter, which will accumulate the metrics and expose them to be 
> scraped by Prometheus at /metrics. The problem is that once this value is set 
> when the scheduler runs its first loop, it will always be exposed to 
> Prometheus as 1. The scheduler can crash, or be turned off and the statsd 
> exporter will report a 1 until it is restarted and rebuilds its internal 
> state.
> By turning this metric into a counter, we can detect an issue with the 
> scheduler by graphing and alerting using a rate. If the rate of change of the 
> counter drops below what it should be at (determined by the 
> scheduler_heartbeat_secs setting), we can fire an alert.
> This should be helpful for adoption in Kubernetes environments where 
> Prometheus is pretty much the standard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to