Finally had some time to go over the changes and wrote summary of it in AIP for better understanding, and updated the proposal doc: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow#AIP49OpenTelemetrySupportforApacheAirflow-CodeChangesSummary
Hope this helps! Howard On Fri, Apr 29, 2022 at 11:49 AM Howard Yoo <howard...@gmail.com> wrote: > Got it. Will describe the changes that might be subjected to be done - > however, please note that the changes were the things that I did on my own > during the POC, so the final changes may obviously be different. > > On Fri, Apr 29, 2022 at 3:57 AM Elad Kalif <elad...@apache.org> wrote: > >> I'd prefer that the Code Changes section would be more detailed. This is >> a very important part. >> I'm not sure referring to a private repo is the way to do it but even if >> you should at least list the components which are subject to changes. >> You don't need to specify every change you are going to do - only the >> critical ones where it couples with airflow core. >> You have changes for dag_proccessing, executors, scheduler_job and even >> migration to add columns. >> >> https://github.com/howardyoo/airflow/commit/cc83c2b377ac22f0e7ef82e7f59784df972037fd >> These are important >> >> >> On Wed, Apr 27, 2022 at 10:09 PM Howard Yoo <howard...@gmail.com> wrote: >> >>> Hi Elad, I have updated the AIP-49 with the appropriate changes to >>> contain what you requested in your previous comments. I think the AIP is >>> fairly comprehensive and ready to be voted, unless there is any objections. >>> >>> Sincerely, >>> Howard >>> >>> On Mon, Apr 25, 2022 at 11:03 PM Howard Yoo <howard...@gmail.com> wrote: >>> >>>> Hi Elad, >>>> Will take a look at it and let you know! >>>> >>>> Thanks, >>>> Howard >>>> >>>> On Mon, Apr 25, 2022 at 6:35 AM Elad Kalif <elad...@apache.org> wrote: >>>> >>>>> Hi Howard, >>>>> Have you made progress with addressing the comments? >>>>> I think once points are addressed we can maybe start a vote? >>>>> >>>>> On Tue, Apr 12, 2022 at 12:39 PM Jarek Potiuk <ja...@potiuk.com> >>>>> wrote: >>>>> >>>>>> > 1. I would assume the path for StatsD (dogstatsd) would be for >>>>>> deprecation - we will perhaps comment or mark it as deprecation - and >>>>>> should follow the established (or usual?) process of feature deprecation >>>>>> of >>>>>> Airflow, once the opentelemetry is in place. >>>>>> >>>>>> Yep. Deprecation should be. And maybe accompanied with a "statsd >>>>>> open-telemetry exporter" ? I think eventually we should have just OTEL >>>>>> stats and nothing else. The current "configurable" metrics class >>>>>> should be fully replaced with "OTEL configurable" metrics. Same as >>>>>> (but that's a much longer and uncertain path) logging configuration >>>>>> should get replaced if/when OTEL logging lives to its promises. >>>>>> >>>>>> > 2. Airflow will provide its own list of 'instrumented metrics' out >>>>>> of the box - with the list specified in its documentation so that users >>>>>> would be aware of them. However, with opentelemetry, the users will also >>>>>> have the ability to add their 'custom' metrics / traces / logs to be >>>>>> collected via opentelemetry as needed. That option and how-to's should >>>>>> also >>>>>> be documented. >>>>>> >>>>>> Good point. Actually the nice thing is that adding new metrics could >>>>>> then be done as part of task execution for example - so we should >>>>>> indeed have some libraries/tools or maybe even operator's /task >>>>>> interface should have some built-in capabilities and allow for >>>>>> declarative ways of adding metrics (but this can be added as a >>>>>> follow-up - it does not need to be described and hashed out yet IMHO - >>>>>> but we can add it to the docs as "future improvement") . >>>>>> >>>>>> > 3. A little clarification - I believe there were two POC's - which >>>>>> at the time during the first POC, it had lacked some of the features >>>>>> (e.g. >>>>>> traces and logs). However, in the second POC, there is a stable release >>>>>> of >>>>>> Traces, and beta release of logs, so things have been progressing. >>>>>> Opentelemetry is highly evolving and many things do change and get added >>>>>> relatively quickly. During my second POC (the one mentioned in the >>>>>> attached >>>>>> PDF) I was able to validate that all of the 'key' features that we needed >>>>>> to implement opentelemetry for airflow has been released and available. >>>>>> >>>>>> Yeah. I concur with that Howard wrote. What was there a few months >>>>>> ago a little "shaky", becomes more and more solid as time passes. And >>>>>> I also spoke with a few people who are involved in OpenTelemetry >>>>>> standard and development (one of my friends is site-lead for Sumo >>>>>> Logic Poland and they take active part in that effort and I just spoke >>>>>> to him about it). Also AWS is very much vested in it as far as I >>>>>> understand - I also spoke to Google and they are very much supporting >>>>>> OTEL as industry standard. I think there is a firm industry backing >>>>>> behind OTEL and there is no way it "won't progress" or "falter". This >>>>>> is a little bit of a "leap of faith" that it will become fully >>>>>> featured for our needs, but I think what is already there is "enough" >>>>>> to justify the move and anything that comes out of Beta is a bonus. >>>>>> Eventually if things will not go fast enough for us - we can also >>>>>> actually contribute there on the "collection" level. It will likely >>>>>> have a much better outcome than if we try to integrate airflow with >>>>>> multiple services ourselves - we will just have to make sure things >>>>>> get collected "well" - and then all the different services will more >>>>>> likely than not write exporters for it. >>>>>> >>>>>> > 4. This is a debatable topic - and I believe those may not be a >>>>>> part of the core airflow code base, but would suggest perhaps we could >>>>>> create a 'contribution' repo which may maintain those third party assets >>>>>> (e.g. Grafana dashboard, Datadog dashboard, etc.) that users may use and >>>>>> even participate in maintaining it. >>>>>> >>>>>> Yeah. I think this is also an opportunity for someone who is >>>>>> specializing in those or even let Grafana add it to their "portfolio". >>>>>> I can imagine there might be some basic dashboards generated by >>>>>> companies which do some kinds of integrations (and with an option of >>>>>> "come to us when you want more customizability". While we won't be >>>>>> able to endorse those, we can easily let it land in the "ecosystem" >>>>>> page of ours. I think also we can easily reach out (for example to >>>>>> Grafana to do it - in the case of Grafana, Myrle Krantz, ex Treasurer >>>>>> from ASF who I know well is a Senior Manager in Grafana responsible >>>>>> for Cloud team). >>>>>> >>>>>> >>>>>> > 5. POC may not have all the changes, as the work was to 'prototype' >>>>>> and answer questions like 'what if' when opentelemetry is in place. The >>>>>> proposal actually has link to a GIT repo that contains the changes that >>>>>> were done during the POC: >>>>>> https://github.com/howardyoo/airflow/tree/opentelemetry-poc-1 . >>>>>> Since the details of the changes would make the existing PDF even more >>>>>> subjected to TL;DR, I linked this git branch for anyone interested in the >>>>>> changes to take a look. I hope this would be sufficient. >>>>>> >>>>>> I think Elad is right that some more details need to "surface" from >>>>>> the POC to AIP. While the changes are very little - they don't change >>>>>> any flows or logic in Airflow, it's more to "selectively" add >>>>>> collections and make sure common "Span" id is shared throughout the >>>>>> code (which I think is the biggest change). >>>>>> I think Howard, it might make sense to extract parts of the POC and >>>>>> put them as an outline of changes to implement in the "AIP". >>>>>> >>>>>> The POC is more trace of what you've done, but I think simply >>>>>> translating this into "this is what we need to do" for those who will >>>>>> just read AIP is important. Our AIP's are much more "Technical" in >>>>>> nature than most of the >>>>>> >>>>>> > >>>>>> > 6. Yes, I agree - I will update the AIP proposal to make the scope >>>>>> more clearer. Thanks for the feedback! >>>>>> >>>>>> Yeah. Adding "future improvements" and especially "This is what we are >>>>>> not going to do and leave for later" is important part of every AIP. I >>>>>> think it's sometimes much more important to state what we are NOT >>>>>> going to vs. what we are going to do. >>>>>> >>>>>> > >>>>>> > Howard >>>>>> > >>>>>> > On Sun, Apr 3, 2022 at 1:27 PM Elad Kalif <elad...@apache.org> >>>>>> wrote: >>>>>> >> >>>>>> >> Thanks Howard! >>>>>> >> Sorry for the delay, this was a long read. The PDF alone is 19 >>>>>> pages :) >>>>>> >> looks very good! >>>>>> >> >>>>>> >> I have 6 questions/points to raise: >>>>>> >> >>>>>> >> 1. I'm not clear about what is to happen with StatsD . >>>>>> >> It states "Make OpenTelemetry and StatsD optional and >>>>>> interchangeable." >>>>>> >> But do we want to support both in the long run? >>>>>> >> It doesn't specify if we are deprecating statsD along with >>>>>> completion of this AIP. >>>>>> >> >>>>>> >> 2. regarding adding metrics. >>>>>> >> Do we intend to let users define their own KPIs/metrics to be >>>>>> measured or it will be a closed list set by Airflow? >>>>>> >> >>>>>> >> 3. The POC specifies it uses a feature of open-telemetry (add >>>>>> metrics) which was not yet released. Do we know the timeline for the >>>>>> feature to be released? >>>>>> >> Can we vote on a plan to use something which is not yet publicly >>>>>> available? >>>>>> >> >>>>>> >> 4. Are the Grafana dashboards / other dashboards to be part of the >>>>>> Airflow core code base? >>>>>> >> I wonder if this should be in a dedicated repo? >>>>>> >> >>>>>> >> 5. Could you please clarify in the AIP page what changes are >>>>>> required to the Airflow code base? >>>>>> >> Some of them appear in the PDF but I'm not sure if that is all of >>>>>> them? >>>>>> >> >>>>>> >> 6. The PDF has several open questions/ideas. >>>>>> >> I think it would be best to add to the AIP a scope paragraph >>>>>> listing what will be included in the first phase and what is left for >>>>>> other >>>>>> phases. >>>>>> >> >>>>>> >> On Fri, Apr 1, 2022 at 4:26 PM Jarek Potiuk <ja...@potiuk.com> >>>>>> wrote: >>>>>> >>> >>>>>> >>> Or maybe that people are so stunned by the beauty and usefulness >>>>>> of it >>>>>> >>> that they cannot even say a word :) >>>>>> >>> >>>>>> >>> On Thu, Mar 31, 2022 at 11:37 PM Howard Yoo <howard...@gmail.com> >>>>>> wrote: >>>>>> >>> > >>>>>> >>> > I think it may mean all is well, perhaps :-) >>>>>> >>> > Howard >>>>>> >>> > >>>>>> >>> > On Thu, Mar 31, 2022 at 9:29 AM Jarek Potiuk <ja...@potiuk.com> >>>>>> wrote: >>>>>> >>> >> >>>>>> >>> >> Hello everyone, >>>>>> >>> >> >>>>>> >>> >> Would be great to get some comments and reviews - especially >>>>>> from >>>>>> >>> >> those who are users and are doing monitoring. Howard made a >>>>>> lot of >>>>>> >>> >> effort to show examples of how OTEL might help in this. >>>>>> >>> >> >>>>>> >>> >> Otherwise, is the silence sign that all is good ? >>>>>> >>> >> >>>>>> >>> >> J. >>>>>> >>> >> >>>>>> >>> >> On Fri, Mar 25, 2022 at 9:53 PM Jarek Potiuk <ja...@potiuk.com> >>>>>> wrote: >>>>>> >>> >> > >>>>>> >>> >> > And let me add to it - we laid some foundations for it with >>>>>> Melodie >>>>>> >>> >> > Ezeani - the Outreachy intern where we did some internal >>>>>> integration >>>>>> >>> >> > work that let us understand the challenges and state of >>>>>> >>> >> > open-telemetry. >>>>>> >>> >> > >>>>>> >>> >> > I am super excited about what open-telemetry can bring to >>>>>> Airflow - >>>>>> >>> >> > both short term (in metrics and instrumentation) and longer >>>>>> term - in >>>>>> >>> >> > logging when logging is mature enough. >>>>>> >>> >> > >>>>>> >>> >> > The open-telemetry mov is at the heart of the principles we >>>>>> have - >>>>>> >>> >> > making Airflow "modern" but at the same time delegating >>>>>> what's not >>>>>> >>> >> > "core" to those who can do it better. Open Telemetry is >>>>>> precisely >>>>>> >>> >> > about that, >>>>>> >>> >> > >>>>>> >>> >> > Howard particularly brought a great experience from using >>>>>> Open >>>>>> >>> >> > Telemetry before and making some good judgements and POC on >>>>>> the >>>>>> >>> >> > metrics produced by Airflow and how they can be useful from >>>>>> the >>>>>> >>> >> > "maintainer value" side. I looked at it from the technical >>>>>> integration >>>>>> >>> >> > POV - and Melodie helped to validate some of the assumptions >>>>>> and >>>>>> >>> >> > expose some of the technical challenges. >>>>>> >>> >> > The composite result is good, but we are looking with Howard >>>>>> on some >>>>>> >>> >> > insightful comments and critique - especially from the users >>>>>> of >>>>>> >>> >> > Airflow! >>>>>> >>> >> > >>>>>> >>> >> > I look forward to more cool stuff on Airflow! >>>>>> >>> >> > >>>>>> >>> >> > J. >>>>>> >>> >> > >>>>>> >>> >> > On Fri, Mar 25, 2022 at 9:39 PM Howard Yoo < >>>>>> howard...@gmail.com> wrote: >>>>>> >>> >> > > >>>>>> >>> >> > > Hi all, >>>>>> >>> >> > > >>>>>> >>> >> > > I am pleased to announce the start of the discussion for >>>>>> the new AIP draft that was recently been published: >>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow >>>>>> >>> >> > > >>>>>> >>> >> > > Jarek Potuik and I have been discussing about this >>>>>> proposal since early this year. During that time, we worked together on >>>>>> drafting this proposal, as well as doing another round of mini-POC to >>>>>> refresh and test on feasibility of OpenTelemetry on Apache Airflow. >>>>>> >>> >> > > >>>>>> >>> >> > > As the POC was successful in terms of testing the >>>>>> OpenTelemetry on Airflow, we would like to expand the discussion to a >>>>>> wider >>>>>> user community here this mailing list to gather more consensus, comments, >>>>>> and feedbacks regarding this AIP. >>>>>> >>> >> > > >>>>>> >>> >> > > Sincerely, >>>>>> >>> >> > > Howard and Jarek >>>>>> >>>>>