Hi Elad, I have updated the AIP-49 with the appropriate changes to contain what you requested in your previous comments. I think the AIP is fairly comprehensive and ready to be voted, unless there is any objections.
Sincerely, Howard On Mon, Apr 25, 2022 at 11:03 PM Howard Yoo <[email protected]> wrote: > Hi Elad, > Will take a look at it and let you know! > > Thanks, > Howard > > On Mon, Apr 25, 2022 at 6:35 AM Elad Kalif <[email protected]> wrote: > >> Hi Howard, >> Have you made progress with addressing the comments? >> I think once points are addressed we can maybe start a vote? >> >> On Tue, Apr 12, 2022 at 12:39 PM Jarek Potiuk <[email protected]> wrote: >> >>> > 1. I would assume the path for StatsD (dogstatsd) would be for >>> deprecation - we will perhaps comment or mark it as deprecation - and >>> should follow the established (or usual?) process of feature deprecation of >>> Airflow, once the opentelemetry is in place. >>> >>> Yep. Deprecation should be. And maybe accompanied with a "statsd >>> open-telemetry exporter" ? I think eventually we should have just OTEL >>> stats and nothing else. The current "configurable" metrics class >>> should be fully replaced with "OTEL configurable" metrics. Same as >>> (but that's a much longer and uncertain path) logging configuration >>> should get replaced if/when OTEL logging lives to its promises. >>> >>> > 2. Airflow will provide its own list of 'instrumented metrics' out of >>> the box - with the list specified in its documentation so that users would >>> be aware of them. However, with opentelemetry, the users will also have the >>> ability to add their 'custom' metrics / traces / logs to be collected via >>> opentelemetry as needed. That option and how-to's should also be documented. >>> >>> Good point. Actually the nice thing is that adding new metrics could >>> then be done as part of task execution for example - so we should >>> indeed have some libraries/tools or maybe even operator's /task >>> interface should have some built-in capabilities and allow for >>> declarative ways of adding metrics (but this can be added as a >>> follow-up - it does not need to be described and hashed out yet IMHO - >>> but we can add it to the docs as "future improvement") . >>> >>> > 3. A little clarification - I believe there were two POC's - which at >>> the time during the first POC, it had lacked some of the features (e.g. >>> traces and logs). However, in the second POC, there is a stable release of >>> Traces, and beta release of logs, so things have been progressing. >>> Opentelemetry is highly evolving and many things do change and get added >>> relatively quickly. During my second POC (the one mentioned in the attached >>> PDF) I was able to validate that all of the 'key' features that we needed >>> to implement opentelemetry for airflow has been released and available. >>> >>> Yeah. I concur with that Howard wrote. What was there a few months >>> ago a little "shaky", becomes more and more solid as time passes. And >>> I also spoke with a few people who are involved in OpenTelemetry >>> standard and development (one of my friends is site-lead for Sumo >>> Logic Poland and they take active part in that effort and I just spoke >>> to him about it). Also AWS is very much vested in it as far as I >>> understand - I also spoke to Google and they are very much supporting >>> OTEL as industry standard. I think there is a firm industry backing >>> behind OTEL and there is no way it "won't progress" or "falter". This >>> is a little bit of a "leap of faith" that it will become fully >>> featured for our needs, but I think what is already there is "enough" >>> to justify the move and anything that comes out of Beta is a bonus. >>> Eventually if things will not go fast enough for us - we can also >>> actually contribute there on the "collection" level. It will likely >>> have a much better outcome than if we try to integrate airflow with >>> multiple services ourselves - we will just have to make sure things >>> get collected "well" - and then all the different services will more >>> likely than not write exporters for it. >>> >>> > 4. This is a debatable topic - and I believe those may not be a part >>> of the core airflow code base, but would suggest perhaps we could create a >>> 'contribution' repo which may maintain those third party assets (e.g. >>> Grafana dashboard, Datadog dashboard, etc.) that users may use and even >>> participate in maintaining it. >>> >>> Yeah. I think this is also an opportunity for someone who is >>> specializing in those or even let Grafana add it to their "portfolio". >>> I can imagine there might be some basic dashboards generated by >>> companies which do some kinds of integrations (and with an option of >>> "come to us when you want more customizability". While we won't be >>> able to endorse those, we can easily let it land in the "ecosystem" >>> page of ours. I think also we can easily reach out (for example to >>> Grafana to do it - in the case of Grafana, Myrle Krantz, ex Treasurer >>> from ASF who I know well is a Senior Manager in Grafana responsible >>> for Cloud team). >>> >>> >>> > 5. POC may not have all the changes, as the work was to 'prototype' >>> and answer questions like 'what if' when opentelemetry is in place. The >>> proposal actually has link to a GIT repo that contains the changes that >>> were done during the POC: >>> https://github.com/howardyoo/airflow/tree/opentelemetry-poc-1 . Since >>> the details of the changes would make the existing PDF even more subjected >>> to TL;DR, I linked this git branch for anyone interested in the changes to >>> take a look. I hope this would be sufficient. >>> >>> I think Elad is right that some more details need to "surface" from >>> the POC to AIP. While the changes are very little - they don't change >>> any flows or logic in Airflow, it's more to "selectively" add >>> collections and make sure common "Span" id is shared throughout the >>> code (which I think is the biggest change). >>> I think Howard, it might make sense to extract parts of the POC and >>> put them as an outline of changes to implement in the "AIP". >>> >>> The POC is more trace of what you've done, but I think simply >>> translating this into "this is what we need to do" for those who will >>> just read AIP is important. Our AIP's are much more "Technical" in >>> nature than most of the >>> >>> > >>> > 6. Yes, I agree - I will update the AIP proposal to make the scope >>> more clearer. Thanks for the feedback! >>> >>> Yeah. Adding "future improvements" and especially "This is what we are >>> not going to do and leave for later" is important part of every AIP. I >>> think it's sometimes much more important to state what we are NOT >>> going to vs. what we are going to do. >>> >>> > >>> > Howard >>> > >>> > On Sun, Apr 3, 2022 at 1:27 PM Elad Kalif <[email protected]> wrote: >>> >> >>> >> Thanks Howard! >>> >> Sorry for the delay, this was a long read. The PDF alone is 19 pages >>> :) >>> >> looks very good! >>> >> >>> >> I have 6 questions/points to raise: >>> >> >>> >> 1. I'm not clear about what is to happen with StatsD . >>> >> It states "Make OpenTelemetry and StatsD optional and >>> interchangeable." >>> >> But do we want to support both in the long run? >>> >> It doesn't specify if we are deprecating statsD along with completion >>> of this AIP. >>> >> >>> >> 2. regarding adding metrics. >>> >> Do we intend to let users define their own KPIs/metrics to be >>> measured or it will be a closed list set by Airflow? >>> >> >>> >> 3. The POC specifies it uses a feature of open-telemetry (add >>> metrics) which was not yet released. Do we know the timeline for the >>> feature to be released? >>> >> Can we vote on a plan to use something which is not yet publicly >>> available? >>> >> >>> >> 4. Are the Grafana dashboards / other dashboards to be part of the >>> Airflow core code base? >>> >> I wonder if this should be in a dedicated repo? >>> >> >>> >> 5. Could you please clarify in the AIP page what changes are required >>> to the Airflow code base? >>> >> Some of them appear in the PDF but I'm not sure if that is all of >>> them? >>> >> >>> >> 6. The PDF has several open questions/ideas. >>> >> I think it would be best to add to the AIP a scope paragraph listing >>> what will be included in the first phase and what is left for other phases. >>> >> >>> >> On Fri, Apr 1, 2022 at 4:26 PM Jarek Potiuk <[email protected]> wrote: >>> >>> >>> >>> Or maybe that people are so stunned by the beauty and usefulness of >>> it >>> >>> that they cannot even say a word :) >>> >>> >>> >>> On Thu, Mar 31, 2022 at 11:37 PM Howard Yoo <[email protected]> >>> wrote: >>> >>> > >>> >>> > I think it may mean all is well, perhaps :-) >>> >>> > Howard >>> >>> > >>> >>> > On Thu, Mar 31, 2022 at 9:29 AM Jarek Potiuk <[email protected]> >>> wrote: >>> >>> >> >>> >>> >> Hello everyone, >>> >>> >> >>> >>> >> Would be great to get some comments and reviews - especially from >>> >>> >> those who are users and are doing monitoring. Howard made a lot of >>> >>> >> effort to show examples of how OTEL might help in this. >>> >>> >> >>> >>> >> Otherwise, is the silence sign that all is good ? >>> >>> >> >>> >>> >> J. >>> >>> >> >>> >>> >> On Fri, Mar 25, 2022 at 9:53 PM Jarek Potiuk <[email protected]> >>> wrote: >>> >>> >> > >>> >>> >> > And let me add to it - we laid some foundations for it with >>> Melodie >>> >>> >> > Ezeani - the Outreachy intern where we did some internal >>> integration >>> >>> >> > work that let us understand the challenges and state of >>> >>> >> > open-telemetry. >>> >>> >> > >>> >>> >> > I am super excited about what open-telemetry can bring to >>> Airflow - >>> >>> >> > both short term (in metrics and instrumentation) and longer >>> term - in >>> >>> >> > logging when logging is mature enough. >>> >>> >> > >>> >>> >> > The open-telemetry mov is at the heart of the principles we >>> have - >>> >>> >> > making Airflow "modern" but at the same time delegating what's >>> not >>> >>> >> > "core" to those who can do it better. Open Telemetry is >>> precisely >>> >>> >> > about that, >>> >>> >> > >>> >>> >> > Howard particularly brought a great experience from using Open >>> >>> >> > Telemetry before and making some good judgements and POC on the >>> >>> >> > metrics produced by Airflow and how they can be useful from the >>> >>> >> > "maintainer value" side. I looked at it from the technical >>> integration >>> >>> >> > POV - and Melodie helped to validate some of the assumptions and >>> >>> >> > expose some of the technical challenges. >>> >>> >> > The composite result is good, but we are looking with Howard on >>> some >>> >>> >> > insightful comments and critique - especially from the users of >>> >>> >> > Airflow! >>> >>> >> > >>> >>> >> > I look forward to more cool stuff on Airflow! >>> >>> >> > >>> >>> >> > J. >>> >>> >> > >>> >>> >> > On Fri, Mar 25, 2022 at 9:39 PM Howard Yoo <[email protected]> >>> wrote: >>> >>> >> > > >>> >>> >> > > Hi all, >>> >>> >> > > >>> >>> >> > > I am pleased to announce the start of the discussion for the >>> new AIP draft that was recently been published: >>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow >>> >>> >> > > >>> >>> >> > > Jarek Potuik and I have been discussing about this proposal >>> since early this year. During that time, we worked together on drafting >>> this proposal, as well as doing another round of mini-POC to refresh and >>> test on feasibility of OpenTelemetry on Apache Airflow. >>> >>> >> > > >>> >>> >> > > As the POC was successful in terms of testing the >>> OpenTelemetry on Airflow, we would like to expand the discussion to a wider >>> user community here this mailing list to gather more consensus, comments, >>> and feedbacks regarding this AIP. >>> >>> >> > > >>> >>> >> > > Sincerely, >>> >>> >> > > Howard and Jarek >>> >>
