General comment from my side: I think Open Lineage is (and should be even more) a feature of Airflow that expands Airflow's capabilities greatly and opens up the direction we've been all working on - Airflow as a Platform.
I think closely integrating it with Open-Lineage goes the same direction (also mentioned in the doc) as Open Telemetry goes, where we might decide to support certain standards in order to expand capabilities of Airflow-as-a-platform and allows to plug-in multiple external solutions that would use the standard API. After Open-Lineage graduated recently to LFAI&Data foundation (I've been watching this happening from far), it is I think the perfect candidate for Airflow to incorporate it. I hope this will help all the players to make use of the extra work necessary by the community to make it "officially supported". I think we have to also get some feedback from the big stakeholders in Airflow - because one thing is to have such a capability, and another is to get it used in all the ways Airflow is used - not only by on-premise/self-hosted users (which is obviously a huge driving factor) but also everywhere where Airflow is exposed by others - Astronomer is obviously on-board. we see some warm words from Amazon (mentioned by Julian), I would love to hear whether the Composer team at Google would be on board in using the open-lineage information exposed this way in their Data Catalog (and likely more) offering. We have Amundsen and others and possibly other stakeholders might want to say something. There is - undoubtedly - an extra effort involved in implementing and keeping it running smoothly (as Julian mentioned, that is the main reason why the Open Lineage community would like to make the integration part of Airflow. But by being smart and integrating it in the way that will allow to plug-it-in into our CI, verification process and making some very clear expectations about what it means for contributors to Airflow to get it running, we can make some initial investment in making it happen and minimise on-going cost, while maximising the gain. And looking at all the above - I am super happy to help with all that to make this easy to "swallow" and integrate well, even if it will take an extra effort, especially that we will have experts from Open Lineage who worked with both Airflow and Open Lineage being the core part of the effort. I am actually super excited - this might be the next-big-thing for Airflow to strengthen its position as an indispensable component of "even more modern data stack". I made my initial comments in the doc, and am looking forward to making it happen :). J. On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem <jul...@astronomer.io.invalid> wrote: > > Dear Airflow Community, > I have been working on a proposal to bring an OpenLineage provider to Airflow. > I am looking for feedback with the goal to post an official AIP. > Please feel free to comment in the doc above. > Thank you, > Julien (OpenLineage project lead) > > For convenience, here is the rationale from the doc: > > Operational lineage collection is a common need to understand dependencies > between data pipelines and track end-to-end provenance of data. It enables > many use cases from ensuring reliable delivery of data through observability > to compliance and cost management. > > Publishing operational lineage is a core Airflow capability to enable > troubleshooting and governance. > > OpenLineage is a project part of the LFAI&Data foundation that provides a > spec standardizing operational lineage collection and sharing across the data > ecosystem. If it provides plugins for popular open source projects, its > intent is very similar to OpenTelemetry (also under the Linux Foundation > umbrella): to remain a spec for lineage exchange that projects - open source > or proprietary - implement. > > Built-in OpenLineage support in Airflow will make it easier and more reliable > for Airflow users to publish their operational lineage through the > OpenLineage ecosystem. > > The current external plugin maintained in the OpenLineage project depends on > Airflow and operators internals and gets broken when changes are made on > those. Having a built-in integration ensures a better first class support to > expose lineage that gets tested alongside other changes and therefore is more > stable.