hi all, been lurking for a while - this is my first post. what I like about open telemetry is that you can send all telemetry traces to STDOUT (or any logs) which you can then pipe to many log forwarders of choice. imo this is the easiest way to set it up and a default that should work in the vast majority of airflow use cases.
the PR looks like a great start! what can I do to help? --- nick > On Jan 7, 2022, at 14:37, Elad Kalif <[email protected]> wrote: > > Hi Howard, > > We actually have outreachy intern (Melodie) that is working on researching > how open-telemetry can be integrated with Airflow. > Draft PR for demo : https://github.com/apache/airflow/pull/20677 > <https://github.com/apache/airflow/pull/20677> > This is an initial effort for a POC. > Maybe you can work together on this? > > > On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo <[email protected]> > wrote: > Hi all, > > I’m a staff product manager in Astronomer, and wanted to post this email > according to the guide from > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals > > <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals> > . > > Currently, the main method to publish telemetry data out of airflow is > through its statsD implementation : > https://github.com/apache/airflow/blob/main/airflow/stats.py > <https://github.com/apache/airflow/blob/main/airflow/stats.py> , and > currently airflow supports two flavors of stated, the original one, and data > dog’s dogstatsd implementation. > > Through this implementation, we have the following list of metrics that would > be available for other popular monitoring tools to collect, monitor, > visualize, and alert on metrics generated from airflow: > https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html > > <https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html> > > > There are a number of limitations of airflow’s current implementation of its > metrics using stated. > 1. StatsD is based on simple metrics format that does not support richer > contexts. Its metric name would contain some of those contexts (such as dag > id, task id, etc), but those can be limited due to the formatting issue of > having to be a part of metric name itself. A better approach would be to > utilizing ‘tags’ to be attached to the metrics data to add more contexts. > 2. StatsD also utilizes UDP as its main network protocol, but UDP protocol is > simple and does not guarantee the reliable transmission of the payload. > Moreover, many monitoring protocols are moving into more modern protocols > such as https to send out metrics. > 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not support > distributed traces and log ingestion. > > Due to the above reasons, I have been looking at opentelemetry > (https://github.com/open-telemetry <https://github.com/open-telemetry>) as a > potential replacement for airflow’s current telemetry instrumentation. > Opentelemetry is a product of opentracing and open census, and is quickly > gaining momentum in terms of ‘standardization’ of means to producing and > delivering telemetry data. Not only metrics, but distributed traces, as well > as logs. The technology is also geared towards better monitoring cloud-native > software. Many monitoring tools vendors are supporting opentelemetry (Tanzu, > Datadog, Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture > is designed to be compatible with existing legacy instrumentations. There are > also a stable python SDKs and APIs to easily implement it into airflow. > > Therefore, I’d like to work on proposing of improving metrics and telemetry > capability of airflow by adding configuration and support of open telemetry > so that while maintaining the backward compatibility of existing stated based > metrics, we would also have an opportunity to have distributed traces and > logs to be based on it, so that it would be easier for any Opentelemetry > compatible tools to be able to monitor airflow with richer information. > > If you were thinking of a need to improve the current metrics capabilities of > airflow, and have been thinking of standards like Opentelemetry, please feel > free to join the thread and provide any opinions or feedback. I also > generally think that we may need to review our current list of metrics and > assess whether they are really useful in terms of monitoring and > observability of airflow. There are things that we might want to add into > metrics such as more executor related metrics, scheduler related metrics, as > well as operators and even DB and XCOM related metrics to better assess the > health of airflow and make these information helpful for faster > troubleshooting and problem resolution. > > Thanks and regards, > Howard
