Hi- I'm pretty new to airflow, and I'm trying to work on getting visibility/observability into what airflow is doing.
I would like to be able to observe things about dag runs and task instances. I would like to be able to send metrics to time series databases (possibly extending the existing airflow metrics exported in statsd format). I would like to be able to label these data points with things like the dag run id. Let's say that every time a task succeeded, was retried, or failed, it sent a datapoint where the value was the duration of the task, and it was labeled with dag run id. Then I could plot these datapoints in a stacked bar graph, where each stack includes the total duration of all tasks in a dag run, the height is the total duration, and I could analyze whether my dag runs are getting faster or slower over time. I would like information from airflow state to be able to label metric datapoints. I would like to be able to apply my own labels (perhaps via callback that could be executed at the time the metric is published, and given context parameter(s)). I believe I can do this today, but it requires quite a bit of work on my end, and all developers working on my team/cluster will sort of need to be onboarded with a standard way of doing things, otherwise not everything running in the cluster will be observed the same way. I could try adding callbacks to all my operators. But there are some problems here. When the on_success_callback runs, I know when the task started, but I have to kind of infer the total duration by taking the difference between the task start time and "now" in the middle of the callback. Also it's kind of tedious and error-prone to make sure that these callbacks are actually used everywhere. I could use custom operators, which is slightly less tedious than augmenting every operator instance, but still not as elegant to me as the idea of airflow offering task, operator, and/or dag middleware extension points. An alternative way of approaching the problem would be if airflow fired webhooks, corresponding to events like dag runs starting, ending, being cleared, or tasks starting, being retried, etc. Think of github firing webhooks when branches are updated, PRs opened, updated, closed, etc, and CI/CD systems loosely coupled to github webhooks that run builds and deploy code. If you could configure airflow to make a request to a webhook endpoint whenever something happened, you could include a bunch of relevant state in the payload. The listener could use the info in the payload to build the metrics I need, and more. The listener could also make further requests to the airflow api if necessary. I haven't really explored how the perspective on the problem changes if I wanted to use pull style metrics instead of push style. For example, if I wanted to get the same graph I described above, except from prometheus scraping airflow, then maybe I wouldn't need middleware or hooks. If airflow fires webhooks, then I need a webhook listener to interpret these hooks and get data into my monitoring system. If airflow provides middleware extension points, then I need to stick my custom code directly within airflow, which sort of introduces a coupling between airflow and the monitoring system. If I provide a metrics endpoint for something like prometheus to scrape, then additional logic or actions I would have stuck on my webhook listener now has to plug into or monitor prometheus instead. I'm putting this out to the list for two reasons: Maybe someone can suggest a simple solution where I can write a small amount of code in one place to glue things together, and not depend on devs on my team always reaching for the correct custom operator if they're used to using the standard ones. If there is no satisfactory solution, then what's the right way to evolve airflow so that there is a satisfactory solution? For additional context I asked about this on slack already: https://apache-airflow.slack.com/archives/CCR6P6JRL/p1551743461193800 https://apache-airflow.slack.com/archives/CCR6P6JRL/p1553171296597100 I have to admit that I don't fully understand what @bosnjak was getting at here, maybe this is really the answer to all my problems: >bosnjak [10 days ago] if you want to handle them differently depending on the task, you should have a single callback handler that can route the calls depending on the context, which is passed to the handler by default. You can access task instance and other stuff from there.
