++ Michal Modras On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <eu...@kosteev.com> wrote:
> Cloud Composer recently launched "Data lineage with Dataplex" feature > which effectively means to generate lineage out of DAG/task executions and > export it to Data Lineage (Data Catalog service) for further analysis. > https://cloud.google.com/composer/docs/composer-2/lineage-integration > > This feature is as of now in the "Preview" state. > The current implementation uses built-in "Airflow lineage > backend" feature and methods to extract lineage metadata on task > post execution events. > > The general idea was to contribute this to the Airflow community in a form: > - generalize lineage metadata extraction as self-method in each operator, > using generic lineage entities > - implement "adapter"s to convert generated metadata to Data Lineage > format, Open Lineage format, etc. > > Adoption of "Airflow OpenLineage" for Composer would mean to introduce an > additional layer of converting from OpenLineage format to Data Lineage > (Data Catalog/Dataplex) format. But this is definitely a possibility. > > On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem > <jul...@astronomer.io.invalid> wrote: > >> Thank you very much for your input Jarek. >> I am responding in the comments and adding to the doc accordingly. >> I would also love to hear from more stakeholders. >> Thanks to all who provided feedback so far. >> Julien >> >> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <ja...@potiuk.com> wrote: >> >>> General comment from my side: I think Open Lineage is (and should be >>> even more) a feature of Airflow that expands Airflow's capabilities >>> greatly and opens up the direction we've been all working on - Airflow >>> as a Platform. >>> >>> I think closely integrating it with Open-Lineage goes the same >>> direction (also mentioned in the doc) as Open Telemetry goes, where we >>> might decide to support certain standards in order to expand >>> capabilities of Airflow-as-a-platform and allows to plug-in multiple >>> external solutions that would use the standard API. After Open-Lineage >>> graduated recently to LFAI&Data foundation (I've been watching this >>> happening from far), it is I think the perfect candidate for Airflow >>> to incorporate it. I hope this will help all the players to make use >>> of the extra work necessary by the community to make it "officially >>> supported". I think we have to also get some feedback from the big >>> stakeholders in Airflow - because one thing is to have such a >>> capability, and another is to get it used in all the ways Airflow is >>> used - not only by on-premise/self-hosted users (which is obviously a >>> huge driving factor) but also everywhere where Airflow is exposed by >>> others - Astronomer is obviously on-board. we see some warm words from >>> Amazon (mentioned by Julian), I would love to hear whether the >>> Composer team at Google would be on board in using the open-lineage >>> information exposed this way in their Data Catalog (and likely more) >>> offering. We have Amundsen and others and possibly other stakeholders >>> might want to say something. >>> >>> >>> There is - undoubtedly - an extra effort involved in implementing and >>> keeping it running smoothly (as Julian mentioned, that is the main >>> reason why the Open Lineage community would like to make the >>> integration part of Airflow. But by being smart and integrating it in >>> the way that will allow to plug-it-in into our CI, verification >>> process and making some very clear expectations about what it means >>> for contributors to Airflow to get it running, we can make some >>> initial investment in making it happen and minimise on-going cost, >>> while maximising the gain. >>> >>> And looking at all the above - I am super happy to help with all that >>> to make this easy to "swallow" and integrate well, even if it will >>> take an extra effort, especially that we will have experts from Open >>> Lineage who worked with both Airflow and Open Lineage being the core >>> part of the effort. I am actually super excited - this might be the >>> next-big-thing for Airflow to strengthen its position as an >>> indispensable component of "even more modern data stack". >>> >>> I made my initial comments in the doc, and am looking forward to >>> making it happen :). >>> >>> J. >>> >>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem >>> <jul...@astronomer.io.invalid> wrote: >>> > >>> > Dear Airflow Community, >>> > I have been working on a proposal to bring an OpenLineage provider to >>> Airflow. >>> > I am looking for feedback with the goal to post an official AIP. >>> > Please feel free to comment in the doc above. >>> > Thank you, >>> > Julien (OpenLineage project lead) >>> > >>> > For convenience, here is the rationale from the doc: >>> > >>> > Operational lineage collection is a common need to understand >>> dependencies between data pipelines and track end-to-end provenance of >>> data. It enables many use cases from ensuring reliable delivery of data >>> through observability to compliance and cost management. >>> > >>> > Publishing operational lineage is a core Airflow capability to enable >>> troubleshooting and governance. >>> > >>> > OpenLineage is a project part of the LFAI&Data foundation that >>> provides a spec standardizing operational lineage collection and sharing >>> across the data ecosystem. If it provides plugins for popular open source >>> projects, its intent is very similar to OpenTelemetry (also under the Linux >>> Foundation umbrella): to remain a spec for lineage exchange that projects - >>> open source or proprietary - implement. >>> > >>> > Built-in OpenLineage support in Airflow will make it easier and more >>> reliable for Airflow users to publish their operational lineage through the >>> OpenLineage ecosystem. >>> > >>> > The current external plugin maintained in the OpenLineage project >>> depends on Airflow and operators internals and gets broken when changes are >>> made on those. Having a built-in integration ensures a better first class >>> support to expose lineage that gets tested alongside other changes and >>> therefore is more stable. >>> >> > > -- > Eugene > -- Eugene