Re: [DISCUSS] airflow telemetry : improve with open telemetry

nick Sun, 09 Jan 2022 16:20:31 -0800

This sounds great. I left a small comment about the console-span processor. 
While I think Jaeger is a great choice for production-dashboards just printing 
the spans to STDOUT from the airflow server would be a great poc imo b/c it 
starts the discussion toward structured logging. By having a discussion on 
structured logging first, everything down the line (dashboards, metrics, slo's 
etc.) will be much easier.


fwiw, I think I’d like to see log ingestion services parse structured data and 
create the dashboards on my behalf from just this. I don’t think as an end-user 
of tools like Datadog I’d be asking for too much.

if there’s a working session on this would love to join, thanks all for this 
awesome project!
---
nick

> On Jan 9, 2022, at 08:35, Jarek Potiuk <[email protected]> wrote:
> 
> Good news - I managed to debug and fix/workaround the flask auto
> instrumentation to work and Melodie should be unblocked.
> 
> It was not an easy one to pull - it required a bit of knowledge of how
> airflow webserver works under the hood and finding out that the
> gunicorn's fork model needs a workaround for open-telemetry
> integration.
> 
> This makes the need of our open-instrumentation a bit more "complex"
> (but only a bit) and slightly more "exporter-specific" - currently we
> have hard-coded  Jaeger Exporter - but in the future we should be able
> to get it possibly better automated - in the future we might even not
> need to do any workarounds once this one:
> https://github.com/open-telemetry/opentelemetry-python-contrib/issues/171
> (Add integration for Gunicorn) will be implemented in
> open-telemetry-python library (maybe even we can contribute it).
> 
> You can see the changes I had to implement here:
> https://github.com/apache/airflow/pull/20677/commits/31f5fce3ff456dfffcd5814db9504094a0448ded
> and see the comment here for the screenshots from Jaeger:
> https://github.com/apache/airflow/pull/20677#issuecomment-1008327884
> 
> We are going to proceed with further integration (hopefully with less
> of a trouble ) of other existing instrumentations now.
> 
> Howard, Nick,
> 
> I think what might be helpful (and Howards' product manager view might
> be super-helpful) is to define the scope of the integration of the
> "Airflow-specific" telemetry. Defining metrics that we would like to
> have (starting from the current set of metrics) and later propose some
> ways to test it and produce some basica dashboards with some of the
> monitoring tools that we could choose. All as a "Proof-Of-Concept"
> level. so that we can produce some real example and screenshots how
> the open-telemetry integration might  work and what value it might
> bring.
> 
> The end goal of the internship of Melody is to prepare an Airflow
> Improvement Proposal where we could - base on our learning from the
> internship propose how the integration would look like.
> 
> WDYT ?
> 
> J.
> 
> On Sat, Jan 8, 2022 at 12:20 PM Jarek Potiuk <[email protected]> wrote:
>> 
>> Yep. Absolutely. We are at the stage now (and this is something we are
>> looking at (and I have planned to this weekend) is to see why
>> auto-instrumentation of the open-telemetry in the PR of Melody's PR
>> does not seem to auto-instrument our Flask integration (we chose flask
>> as the first integration that should be "easy" but for whatever reason
>> auto-instrumetnation - even in the `--debug` mode of airflow - does
>> not seem to work despite everything seemingly "correct".
>> 
>> I plan to take a look today at it and we can discuss it in Melody's
>> PR. That would be fantastic if we could work on it together  :).
>> 
>> J.
>> 
>> On Sat, Jan 8, 2022 at 12:09 PM melodie ezeani <[email protected]> wrote:
>>> 
>>> Hi nick,
>>> 
>>> You can look at the PR or clone my Fork and try running in your local 
>>> environment and see if there’s any way we can improve on the 
>>> auto-instrumention
>>> Would love to get a feedback.
>>> Thank you
>>> 
>>> On Sat, 8 Jan 2022 at 12:19 AM, <[email protected]> wrote:
>>>> 
>>>> hi all, been lurking for a while - this is my first post.
>>>> 
>>>> what I like about open telemetry is that you can send all telemetry traces 
>>>> to STDOUT (or any logs) which you can then pipe to many log forwarders of 
>>>> choice. imo this is the easiest way to set it up and a default that should 
>>>> work in the vast majority of airflow use cases.
>>>> 
>>>> the PR looks like a great start! what can I do to help?
>>>> ---
>>>> nick
>>>> 
>>>> On Jan 7, 2022, at 14:37, Elad Kalif <[email protected]> wrote:
>>>> 
>>>> Hi Howard,
>>>> 
>>>> We actually have outreachy intern (Melodie) that is working on researching 
>>>> how open-telemetry can be integrated with Airflow.
>>>> Draft PR for demo : https://github.com/apache/airflow/pull/20677
>>>> This is an initial effort for a POC.
>>>> Maybe you can work together on this?
>>>> 
>>>> 
>>>> On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo 
>>>> <[email protected]> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I’m a staff product manager in Astronomer, and wanted to post this email 
>>>>> according to the guide from 
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
>>>>>  .
>>>>> 
>>>>> Currently, the main method to publish telemetry data out of airflow is 
>>>>> through its statsD implementation : 
>>>>> https://github.com/apache/airflow/blob/main/airflow/stats.py , and 
>>>>> currently airflow supports two flavors of stated, the original one, and 
>>>>> data dog’s dogstatsd implementation.
>>>>> 
>>>>> Through this implementation, we have the following list of metrics that 
>>>>> would be available for other popular monitoring tools to collect, 
>>>>> monitor, visualize, and alert on metrics generated from airflow: 
>>>>> https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
>>>>> 
>>>>> There are a number of limitations of airflow’s current implementation of 
>>>>> its metrics using stated.
>>>>> 1. StatsD is based on simple metrics format that does not support richer 
>>>>> contexts. Its metric name would contain some of those contexts (such as 
>>>>> dag id, task id, etc), but those can be limited due to the formatting 
>>>>> issue of having to be a part of metric name itself. A better approach 
>>>>> would be to utilizing ‘tags’ to be attached to the metrics data to add 
>>>>> more contexts.
>>>>> 2. StatsD also utilizes UDP as its main network protocol, but UDP 
>>>>> protocol is simple and does not guarantee the reliable transmission of 
>>>>> the payload. Moreover, many monitoring protocols are moving into more 
>>>>> modern protocols such as https to send out metrics.
>>>>> 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not 
>>>>> support distributed traces and log ingestion.
>>>>> 
>>>>> Due to the above reasons, I have been looking at opentelemetry 
>>>>> (https://github.com/open-telemetry) as a potential replacement for 
>>>>> airflow’s current telemetry instrumentation. Opentelemetry is a product 
>>>>> of opentracing and open census, and is quickly gaining momentum in terms 
>>>>> of ‘standardization’ of means to producing and delivering telemetry data. 
>>>>> Not only metrics, but distributed traces, as well as logs. The technology 
>>>>> is also geared towards better monitoring cloud-native software. Many 
>>>>> monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog, 
>>>>> Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is 
>>>>> designed to be compatible with existing legacy instrumentations. There 
>>>>> are also a stable python SDKs and APIs to easily implement it into 
>>>>> airflow.
>>>>> 
>>>>> Therefore, I’d like to work on proposing of improving metrics and 
>>>>> telemetry capability of airflow by adding configuration and support of 
>>>>> open telemetry so that while maintaining the backward compatibility of 
>>>>> existing stated based metrics, we would also have an opportunity to have 
>>>>> distributed traces and logs to be based on it, so that it would be easier 
>>>>> for any Opentelemetry compatible tools to be able to monitor airflow with 
>>>>> richer information.
>>>>> 
>>>>> If you were thinking of a need to improve the current metrics 
>>>>> capabilities of airflow, and have been thinking of standards like 
>>>>> Opentelemetry, please feel free to join the thread and provide any 
>>>>> opinions or feedback. I also generally think that we may need to review 
>>>>> our current list of metrics and assess whether they are really useful in 
>>>>> terms of monitoring and observability of airflow. There are things that 
>>>>> we might want to add into metrics such as more executor related metrics, 
>>>>> scheduler related metrics, as well as operators and even DB and XCOM 
>>>>> related metrics to better assess the health of airflow and make these 
>>>>> information helpful for faster troubleshooting and problem resolution.
>>>>> 
>>>>> Thanks and regards,
>>>>> Howard
>>>> 
>>>>

Re: [DISCUSS] airflow telemetry : improve with open telemetry

Reply via email to