This sounds great. I left a small comment about the console-span processor. While I think Jaeger is a great choice for production-dashboards just printing the spans to STDOUT from the airflow server would be a great poc imo b/c it starts the discussion toward structured logging. By having a discussion on structured logging first, everything down the line (dashboards, metrics, slo's etc.) will be much easier.
fwiw, I think I’d like to see log ingestion services parse structured data and create the dashboards on my behalf from just this. I don’t think as an end-user of tools like Datadog I’d be asking for too much. if there’s a working session on this would love to join, thanks all for this awesome project! --- nick > On Jan 9, 2022, at 08:35, Jarek Potiuk <[email protected]> wrote: > > Good news - I managed to debug and fix/workaround the flask auto > instrumentation to work and Melodie should be unblocked. > > It was not an easy one to pull - it required a bit of knowledge of how > airflow webserver works under the hood and finding out that the > gunicorn's fork model needs a workaround for open-telemetry > integration. > > This makes the need of our open-instrumentation a bit more "complex" > (but only a bit) and slightly more "exporter-specific" - currently we > have hard-coded Jaeger Exporter - but in the future we should be able > to get it possibly better automated - in the future we might even not > need to do any workarounds once this one: > https://github.com/open-telemetry/opentelemetry-python-contrib/issues/171 > (Add integration for Gunicorn) will be implemented in > open-telemetry-python library (maybe even we can contribute it). > > You can see the changes I had to implement here: > https://github.com/apache/airflow/pull/20677/commits/31f5fce3ff456dfffcd5814db9504094a0448ded > and see the comment here for the screenshots from Jaeger: > https://github.com/apache/airflow/pull/20677#issuecomment-1008327884 > > We are going to proceed with further integration (hopefully with less > of a trouble ) of other existing instrumentations now. > > Howard, Nick, > > I think what might be helpful (and Howards' product manager view might > be super-helpful) is to define the scope of the integration of the > "Airflow-specific" telemetry. Defining metrics that we would like to > have (starting from the current set of metrics) and later propose some > ways to test it and produce some basica dashboards with some of the > monitoring tools that we could choose. All as a "Proof-Of-Concept" > level. so that we can produce some real example and screenshots how > the open-telemetry integration might work and what value it might > bring. > > The end goal of the internship of Melody is to prepare an Airflow > Improvement Proposal where we could - base on our learning from the > internship propose how the integration would look like. > > WDYT ? > > J. > > On Sat, Jan 8, 2022 at 12:20 PM Jarek Potiuk <[email protected]> wrote: >> >> Yep. Absolutely. We are at the stage now (and this is something we are >> looking at (and I have planned to this weekend) is to see why >> auto-instrumentation of the open-telemetry in the PR of Melody's PR >> does not seem to auto-instrument our Flask integration (we chose flask >> as the first integration that should be "easy" but for whatever reason >> auto-instrumetnation - even in the `--debug` mode of airflow - does >> not seem to work despite everything seemingly "correct". >> >> I plan to take a look today at it and we can discuss it in Melody's >> PR. That would be fantastic if we could work on it together :). >> >> J. >> >> On Sat, Jan 8, 2022 at 12:09 PM melodie ezeani <[email protected]> wrote: >>> >>> Hi nick, >>> >>> You can look at the PR or clone my Fork and try running in your local >>> environment and see if there’s any way we can improve on the >>> auto-instrumention >>> Would love to get a feedback. >>> Thank you >>> >>> On Sat, 8 Jan 2022 at 12:19 AM, <[email protected]> wrote: >>>> >>>> hi all, been lurking for a while - this is my first post. >>>> >>>> what I like about open telemetry is that you can send all telemetry traces >>>> to STDOUT (or any logs) which you can then pipe to many log forwarders of >>>> choice. imo this is the easiest way to set it up and a default that should >>>> work in the vast majority of airflow use cases. >>>> >>>> the PR looks like a great start! what can I do to help? >>>> --- >>>> nick >>>> >>>> On Jan 7, 2022, at 14:37, Elad Kalif <[email protected]> wrote: >>>> >>>> Hi Howard, >>>> >>>> We actually have outreachy intern (Melodie) that is working on researching >>>> how open-telemetry can be integrated with Airflow. >>>> Draft PR for demo : https://github.com/apache/airflow/pull/20677 >>>> This is an initial effort for a POC. >>>> Maybe you can work together on this? >>>> >>>> >>>> On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo >>>> <[email protected]> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I’m a staff product manager in Astronomer, and wanted to post this email >>>>> according to the guide from >>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals >>>>> . >>>>> >>>>> Currently, the main method to publish telemetry data out of airflow is >>>>> through its statsD implementation : >>>>> https://github.com/apache/airflow/blob/main/airflow/stats.py , and >>>>> currently airflow supports two flavors of stated, the original one, and >>>>> data dog’s dogstatsd implementation. >>>>> >>>>> Through this implementation, we have the following list of metrics that >>>>> would be available for other popular monitoring tools to collect, >>>>> monitor, visualize, and alert on metrics generated from airflow: >>>>> https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html >>>>> >>>>> There are a number of limitations of airflow’s current implementation of >>>>> its metrics using stated. >>>>> 1. StatsD is based on simple metrics format that does not support richer >>>>> contexts. Its metric name would contain some of those contexts (such as >>>>> dag id, task id, etc), but those can be limited due to the formatting >>>>> issue of having to be a part of metric name itself. A better approach >>>>> would be to utilizing ‘tags’ to be attached to the metrics data to add >>>>> more contexts. >>>>> 2. StatsD also utilizes UDP as its main network protocol, but UDP >>>>> protocol is simple and does not guarantee the reliable transmission of >>>>> the payload. Moreover, many monitoring protocols are moving into more >>>>> modern protocols such as https to send out metrics. >>>>> 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not >>>>> support distributed traces and log ingestion. >>>>> >>>>> Due to the above reasons, I have been looking at opentelemetry >>>>> (https://github.com/open-telemetry) as a potential replacement for >>>>> airflow’s current telemetry instrumentation. Opentelemetry is a product >>>>> of opentracing and open census, and is quickly gaining momentum in terms >>>>> of ‘standardization’ of means to producing and delivering telemetry data. >>>>> Not only metrics, but distributed traces, as well as logs. The technology >>>>> is also geared towards better monitoring cloud-native software. Many >>>>> monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog, >>>>> Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is >>>>> designed to be compatible with existing legacy instrumentations. There >>>>> are also a stable python SDKs and APIs to easily implement it into >>>>> airflow. >>>>> >>>>> Therefore, I’d like to work on proposing of improving metrics and >>>>> telemetry capability of airflow by adding configuration and support of >>>>> open telemetry so that while maintaining the backward compatibility of >>>>> existing stated based metrics, we would also have an opportunity to have >>>>> distributed traces and logs to be based on it, so that it would be easier >>>>> for any Opentelemetry compatible tools to be able to monitor airflow with >>>>> richer information. >>>>> >>>>> If you were thinking of a need to improve the current metrics >>>>> capabilities of airflow, and have been thinking of standards like >>>>> Opentelemetry, please feel free to join the thread and provide any >>>>> opinions or feedback. I also generally think that we may need to review >>>>> our current list of metrics and assess whether they are really useful in >>>>> terms of monitoring and observability of airflow. There are things that >>>>> we might want to add into metrics such as more executor related metrics, >>>>> scheduler related metrics, as well as operators and even DB and XCOM >>>>> related metrics to better assess the health of airflow and make these >>>>> information helpful for faster troubleshooting and problem resolution. >>>>> >>>>> Thanks and regards, >>>>> Howard >>>> >>>>
