Thank you Eugen, This sounds very aligned with the goals of OpenLineage and I think this would work well. Here are the sections in the doc that I think address your points: *- generalize lineage metadata extraction as self-method in each operator, using generic lineage entities* See: OpenLineage support in providers <https://docs.google.com/document/d/1aN5i8WV2Za7XiHTtyrewZscQ-4eXs1ZNfPw58JscFEw/edit#heading=h.n53oowz38zuf> . It describes how each operator exposes its lineage. *- implement "adapter"s to convert generated metadata to Data Lineage format, Open Lineage format, etc.* The goal here is each consumer turns from OpenLineage format to their own internal representation as you are suggesting. In the motivation section <https://docs.google.com/document/d/1aN5i8WV2Za7XiHTtyrewZscQ-4eXs1ZNfPw58JscFEw/edit#heading=h.8siih5lo2c33>, towards the end, I link to a few examples of data catalogs doing just that.
On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <[email protected]> wrote: > ++ Michal Modras > > On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <[email protected]> wrote: > >> Cloud Composer recently launched "Data lineage with Dataplex" feature >> which effectively means to generate lineage out of DAG/task executions and >> export it to Data Lineage (Data Catalog service) for further analysis. >> https://cloud.google.com/composer/docs/composer-2/lineage-integration >> >> This feature is as of now in the "Preview" state. >> The current implementation uses built-in "Airflow lineage >> backend" feature and methods to extract lineage metadata on task >> post execution events. >> >> The general idea was to contribute this to the Airflow community in a >> form: >> - generalize lineage metadata extraction as self-method in each operator, >> using generic lineage entities >> - implement "adapter"s to convert generated metadata to Data Lineage >> format, Open Lineage format, etc. >> >> Adoption of "Airflow OpenLineage" for Composer would mean to introduce an >> additional layer of converting from OpenLineage format to Data Lineage >> (Data Catalog/Dataplex) format. But this is definitely a possibility. >> >> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem >> <[email protected]> wrote: >> >>> Thank you very much for your input Jarek. >>> I am responding in the comments and adding to the doc accordingly. >>> I would also love to hear from more stakeholders. >>> Thanks to all who provided feedback so far. >>> Julien >>> >>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <[email protected]> wrote: >>> >>>> General comment from my side: I think Open Lineage is (and should be >>>> even more) a feature of Airflow that expands Airflow's capabilities >>>> greatly and opens up the direction we've been all working on - Airflow >>>> as a Platform. >>>> >>>> I think closely integrating it with Open-Lineage goes the same >>>> direction (also mentioned in the doc) as Open Telemetry goes, where we >>>> might decide to support certain standards in order to expand >>>> capabilities of Airflow-as-a-platform and allows to plug-in multiple >>>> external solutions that would use the standard API. After Open-Lineage >>>> graduated recently to LFAI&Data foundation (I've been watching this >>>> happening from far), it is I think the perfect candidate for Airflow >>>> to incorporate it. I hope this will help all the players to make use >>>> of the extra work necessary by the community to make it "officially >>>> supported". I think we have to also get some feedback from the big >>>> stakeholders in Airflow - because one thing is to have such a >>>> capability, and another is to get it used in all the ways Airflow is >>>> used - not only by on-premise/self-hosted users (which is obviously a >>>> huge driving factor) but also everywhere where Airflow is exposed by >>>> others - Astronomer is obviously on-board. we see some warm words from >>>> Amazon (mentioned by Julian), I would love to hear whether the >>>> Composer team at Google would be on board in using the open-lineage >>>> information exposed this way in their Data Catalog (and likely more) >>>> offering. We have Amundsen and others and possibly other stakeholders >>>> might want to say something. >>>> >>>> >>>> There is - undoubtedly - an extra effort involved in implementing and >>>> keeping it running smoothly (as Julian mentioned, that is the main >>>> reason why the Open Lineage community would like to make the >>>> integration part of Airflow. But by being smart and integrating it in >>>> the way that will allow to plug-it-in into our CI, verification >>>> process and making some very clear expectations about what it means >>>> for contributors to Airflow to get it running, we can make some >>>> initial investment in making it happen and minimise on-going cost, >>>> while maximising the gain. >>>> >>>> And looking at all the above - I am super happy to help with all that >>>> to make this easy to "swallow" and integrate well, even if it will >>>> take an extra effort, especially that we will have experts from Open >>>> Lineage who worked with both Airflow and Open Lineage being the core >>>> part of the effort. I am actually super excited - this might be the >>>> next-big-thing for Airflow to strengthen its position as an >>>> indispensable component of "even more modern data stack". >>>> >>>> I made my initial comments in the doc, and am looking forward to >>>> making it happen :). >>>> >>>> J. >>>> >>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem >>>> <[email protected]> wrote: >>>> > >>>> > Dear Airflow Community, >>>> > I have been working on a proposal to bring an OpenLineage provider to >>>> Airflow. >>>> > I am looking for feedback with the goal to post an official AIP. >>>> > Please feel free to comment in the doc above. >>>> > Thank you, >>>> > Julien (OpenLineage project lead) >>>> > >>>> > For convenience, here is the rationale from the doc: >>>> > >>>> > Operational lineage collection is a common need to understand >>>> dependencies between data pipelines and track end-to-end provenance of >>>> data. It enables many use cases from ensuring reliable delivery of data >>>> through observability to compliance and cost management. >>>> > >>>> > Publishing operational lineage is a core Airflow capability to enable >>>> troubleshooting and governance. >>>> > >>>> > OpenLineage is a project part of the LFAI&Data foundation that >>>> provides a spec standardizing operational lineage collection and sharing >>>> across the data ecosystem. If it provides plugins for popular open source >>>> projects, its intent is very similar to OpenTelemetry (also under the Linux >>>> Foundation umbrella): to remain a spec for lineage exchange that projects - >>>> open source or proprietary - implement. >>>> > >>>> > Built-in OpenLineage support in Airflow will make it easier and more >>>> reliable for Airflow users to publish their operational lineage through the >>>> OpenLineage ecosystem. >>>> > >>>> > The current external plugin maintained in the OpenLineage project >>>> depends on Airflow and operators internals and gets broken when changes are >>>> made on those. Having a built-in integration ensures a better first class >>>> support to expose lineage that gets tested alongside other changes and >>>> therefore is more stable. >>>> >>> >> >> -- >> Eugene >> > > > -- > Eugene >
