++ Michal Modras

On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <eu...@kosteev.com> wrote:

> Cloud Composer recently launched "Data lineage with Dataplex" feature
> which effectively means to generate lineage out of DAG/task executions and
> export it to Data Lineage (Data Catalog service) for further analysis.
> https://cloud.google.com/composer/docs/composer-2/lineage-integration
>
> This feature is as of now in the "Preview" state.
> The current implementation uses built-in "Airflow lineage
> backend" feature and methods to extract lineage metadata on task
> post execution events.
>
> The general idea was to contribute this to the Airflow community in a form:
> - generalize lineage metadata extraction as self-method in each operator,
> using generic lineage entities
> - implement "adapter"s to convert generated metadata to Data Lineage
> format, Open Lineage format, etc.
>
> Adoption of "Airflow OpenLineage" for Composer would mean to introduce an
> additional layer of converting from OpenLineage format to Data Lineage
> (Data Catalog/Dataplex) format. But this is definitely a possibility.
>
> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> <jul...@astronomer.io.invalid> wrote:
>
>> Thank you very much for your input Jarek.
>> I am responding in the comments and adding to the doc accordingly.
>> I would also love to hear from more stakeholders.
>> Thanks to all who provided feedback so far.
>> Julien
>>
>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> General comment from my side: I think Open Lineage is (and should be
>>> even more) a feature of Airflow that expands Airflow's capabilities
>>> greatly and opens up the direction we've been all working on - Airflow
>>> as a Platform.
>>>
>>> I think closely integrating it with Open-Lineage goes the same
>>> direction (also mentioned in the doc) as Open Telemetry goes, where we
>>> might decide to support certain standards in order to expand
>>> capabilities of Airflow-as-a-platform and allows to plug-in multiple
>>> external solutions that would use the standard API. After Open-Lineage
>>> graduated recently to  LFAI&Data foundation (I've been watching this
>>> happening from far), it is I think the perfect candidate for Airflow
>>> to incorporate it. I hope this will help all the players to make use
>>> of the extra work necessary by the community to make it "officially
>>> supported". I think we have to also get some feedback from the big
>>> stakeholders in Airflow - because one thing is to have such a
>>> capability, and another is to get it used in all the ways Airflow is
>>> used - not only by on-premise/self-hosted users (which is obviously a
>>> huge driving factor) but also everywhere where Airflow is exposed by
>>> others - Astronomer is obviously on-board. we see some warm words from
>>> Amazon (mentioned by Julian), I would love to hear whether the
>>> Composer team at Google would be on board in using the open-lineage
>>> information exposed this way in their Data Catalog (and likely more)
>>> offering. We have Amundsen and others and possibly other stakeholders
>>> might want to say something.
>>>
>>>
>>> There is - undoubtedly - an extra effort involved in implementing and
>>> keeping it running smoothly (as Julian mentioned, that is the main
>>> reason why the Open Lineage community would like to make the
>>> integration part of Airflow. But by being smart and integrating it in
>>> the way that will allow to plug-it-in into our CI, verification
>>> process and making some very clear expectations about what it means
>>> for contributors to Airflow to get it running, we can make some
>>> initial investment in making it happen and minimise on-going cost,
>>> while maximising the gain.
>>>
>>> And looking at all the above - I am super happy to help with all that
>>> to make this easy to "swallow" and integrate well, even if it will
>>> take an extra effort, especially that we will have experts from Open
>>> Lineage who worked with both Airflow and Open Lineage being the core
>>> part of the effort. I am actually super excited - this might be the
>>> next-big-thing for Airflow to strengthen its position as an
>>> indispensable component of "even more modern data stack".
>>>
>>> I made my initial comments in the doc, and am looking forward to
>>> making it happen :).
>>>
>>> J.
>>>
>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
>>> <jul...@astronomer.io.invalid> wrote:
>>> >
>>> > Dear Airflow Community,
>>> > I have been working on a proposal to bring an OpenLineage provider to
>>> Airflow.
>>> > I am looking for feedback with the goal to post an official AIP.
>>> > Please feel free to comment in the doc above.
>>> > Thank you,
>>> > Julien (OpenLineage project lead)
>>> >
>>> > For convenience, here is the rationale from the doc:
>>> >
>>> > Operational lineage collection is a common need to understand
>>> dependencies between data pipelines and track end-to-end provenance of
>>> data. It enables many use cases from ensuring reliable delivery of data
>>> through observability to compliance and cost management.
>>> >
>>> > Publishing operational lineage is a core Airflow capability to enable
>>> troubleshooting and governance.
>>> >
>>> > OpenLineage is a project part of the LFAI&Data foundation that
>>> provides a spec standardizing operational lineage collection and sharing
>>> across the data ecosystem. If it provides plugins for popular open source
>>> projects, its intent is very similar to OpenTelemetry (also under the Linux
>>> Foundation umbrella): to remain a spec for lineage exchange that projects -
>>> open source or proprietary - implement.
>>> >
>>> > Built-in OpenLineage support in Airflow will make it easier and more
>>> reliable for Airflow users to publish their operational lineage through the
>>> OpenLineage ecosystem.
>>> >
>>> > The current external plugin maintained in the OpenLineage project
>>> depends on Airflow and operators internals and gets broken when changes are
>>> made on those. Having a built-in integration ensures a better first class
>>> support to expose lineage that gets tested alongside other changes and
>>> therefore is more stable.
>>>
>>
>
> --
> Eugene
>


-- 
Eugene

Reply via email to