Hi all, Howard here. Any more comments related to adding opentelemetry support for Airflow? The feature is not just about Distributed Traces, but Metrics and Logs as well. I know this thread is kind of old now, but just wanted to make sure whether everybody has any negative opinions about going with OpenTelemetry.
Howard On 2022/01/12 03:21:27 Howard Yoo wrote: > I am very much interested in how we can improve > Not only the instrumentation by using OpenTelemetry, but also > Think about how we can make the existing metrics list better. > > For example, perhaps in the future, maybe we can provide things like how much > CPU, memory, and disk I/O a task instance is using, by utilizing python’s > plutil package as mentioned here in > (https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows > > <https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows>), > because local task jobs are essentially subprocesses. By utilizing > OpenTelemetry, we could even collect Host metrics and platform metrics that’s > outside of the boundary of airflow easier - and even have them collected by > the OTEL collector agent at the same time. > > I would be very happy if this internship project can also include > Collecting metrics in addition to the Tracing, and think about how it can be > extended to cover more than what’s provided out of the box. > > - Howard > > On 2022/01/10 21:22:51 Jarek Potiuk wrote: > > > Also, I do have a feedback that current metrics list and what they track > > > are not really that useful > > > > Fully agree. > > > > > (I mean, there is so much that one can do for metrics like operator > > > failures and ti failures - since they don’t post any context specific > > > information) - so while we may be working with making OpenTelemetry > > > available for airflow, we might also investigate and try improvements on > > > reviewing these metrics and really verify whether these metrics are > > > helpful, and if there can be additional metrics that we can instrument > > > while doing this. > > > > Oh yeah. > > > > > I think when we are designing for the distributed traces on Airflow, we > > > should also work on defining what kind of traces would be useful and how > > > to come up with better name convention etc. to make things clear and easy > > > to understand, etc.. > > > > Absolutely! I think we have a very clear "separation" and actually > > "complementary" work that we should indeed do together! > > > > 1) From the "internship project" that we do together with Melody, the > > focus is more on the engineering side - "how we can easily integrate > > open-telemetry" with Airflow - seamlessly and in a modular fashion and > > in the way that will be easy to use and test in "development > > environment". It is more about solving all engineering obstacles with > > integration (for example what we learn now is that Open Telemetry > > requires some custom code to account for a "forking" model. Also about > > exposing a lot of low-level metrics that are not airflow specific > > (flask, db access etc - something that really allows to debug "any" > > application deployment, not only Airflow). Then we thought about > > simply adding the "current" metrics that we have in statsd as custom > > ones. > > > > * And I understand that your focus is - more "how we can actually make > > a really useful set of Airflow metrics" which is ideally complementing > > the "engineering" part - once we get OT fully integrated we can add > > not only (or maybe even not at all) the current metrics but, once you > > help defining "better" metrics, we can simply implement them in OT - > > including some example dashboards etc. > > > > Happy to collaborate on that! > > > > J. > > > > > > > - Howard > > > > >
