Thank you Eugen,
This sounds very aligned with the goals of OpenLineage and I think this
would work well.
Here are the sections in the doc that I think address your points:
*- generalize lineage metadata extraction as self-method in each operator,
using generic lineage entities*
See: OpenLineage support in providers
<https://docs.google.com/document/d/1aN5i8WV2Za7XiHTtyrewZscQ-4eXs1ZNfPw58JscFEw/edit#heading=h.n53oowz38zuf>
. It describes how each operator exposes its lineage.
*- implement "adapter"s to convert generated metadata to Data Lineage
format, Open Lineage format, etc.*
The goal here is each consumer turns from OpenLineage format to their own
internal representation as you are suggesting.
In the motivation section
<https://docs.google.com/document/d/1aN5i8WV2Za7XiHTtyrewZscQ-4eXs1ZNfPw58JscFEw/edit#heading=h.8siih5lo2c33>,
towards the end, I link to a few examples of data catalogs doing just that.

On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <[email protected]> wrote:

> ++ Michal Modras
>
> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <[email protected]> wrote:
>
>> Cloud Composer recently launched "Data lineage with Dataplex" feature
>> which effectively means to generate lineage out of DAG/task executions and
>> export it to Data Lineage (Data Catalog service) for further analysis.
>> https://cloud.google.com/composer/docs/composer-2/lineage-integration
>>
>> This feature is as of now in the "Preview" state.
>> The current implementation uses built-in "Airflow lineage
>> backend" feature and methods to extract lineage metadata on task
>> post execution events.
>>
>> The general idea was to contribute this to the Airflow community in a
>> form:
>> - generalize lineage metadata extraction as self-method in each operator,
>> using generic lineage entities
>> - implement "adapter"s to convert generated metadata to Data Lineage
>> format, Open Lineage format, etc.
>>
>> Adoption of "Airflow OpenLineage" for Composer would mean to introduce an
>> additional layer of converting from OpenLineage format to Data Lineage
>> (Data Catalog/Dataplex) format. But this is definitely a possibility.
>>
>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
>> <[email protected]> wrote:
>>
>>> Thank you very much for your input Jarek.
>>> I am responding in the comments and adding to the doc accordingly.
>>> I would also love to hear from more stakeholders.
>>> Thanks to all who provided feedback so far.
>>> Julien
>>>
>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <[email protected]> wrote:
>>>
>>>> General comment from my side: I think Open Lineage is (and should be
>>>> even more) a feature of Airflow that expands Airflow's capabilities
>>>> greatly and opens up the direction we've been all working on - Airflow
>>>> as a Platform.
>>>>
>>>> I think closely integrating it with Open-Lineage goes the same
>>>> direction (also mentioned in the doc) as Open Telemetry goes, where we
>>>> might decide to support certain standards in order to expand
>>>> capabilities of Airflow-as-a-platform and allows to plug-in multiple
>>>> external solutions that would use the standard API. After Open-Lineage
>>>> graduated recently to  LFAI&Data foundation (I've been watching this
>>>> happening from far), it is I think the perfect candidate for Airflow
>>>> to incorporate it. I hope this will help all the players to make use
>>>> of the extra work necessary by the community to make it "officially
>>>> supported". I think we have to also get some feedback from the big
>>>> stakeholders in Airflow - because one thing is to have such a
>>>> capability, and another is to get it used in all the ways Airflow is
>>>> used - not only by on-premise/self-hosted users (which is obviously a
>>>> huge driving factor) but also everywhere where Airflow is exposed by
>>>> others - Astronomer is obviously on-board. we see some warm words from
>>>> Amazon (mentioned by Julian), I would love to hear whether the
>>>> Composer team at Google would be on board in using the open-lineage
>>>> information exposed this way in their Data Catalog (and likely more)
>>>> offering. We have Amundsen and others and possibly other stakeholders
>>>> might want to say something.
>>>>
>>>>
>>>> There is - undoubtedly - an extra effort involved in implementing and
>>>> keeping it running smoothly (as Julian mentioned, that is the main
>>>> reason why the Open Lineage community would like to make the
>>>> integration part of Airflow. But by being smart and integrating it in
>>>> the way that will allow to plug-it-in into our CI, verification
>>>> process and making some very clear expectations about what it means
>>>> for contributors to Airflow to get it running, we can make some
>>>> initial investment in making it happen and minimise on-going cost,
>>>> while maximising the gain.
>>>>
>>>> And looking at all the above - I am super happy to help with all that
>>>> to make this easy to "swallow" and integrate well, even if it will
>>>> take an extra effort, especially that we will have experts from Open
>>>> Lineage who worked with both Airflow and Open Lineage being the core
>>>> part of the effort. I am actually super excited - this might be the
>>>> next-big-thing for Airflow to strengthen its position as an
>>>> indispensable component of "even more modern data stack".
>>>>
>>>> I made my initial comments in the doc, and am looking forward to
>>>> making it happen :).
>>>>
>>>> J.
>>>>
>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
>>>> <[email protected]> wrote:
>>>> >
>>>> > Dear Airflow Community,
>>>> > I have been working on a proposal to bring an OpenLineage provider to
>>>> Airflow.
>>>> > I am looking for feedback with the goal to post an official AIP.
>>>> > Please feel free to comment in the doc above.
>>>> > Thank you,
>>>> > Julien (OpenLineage project lead)
>>>> >
>>>> > For convenience, here is the rationale from the doc:
>>>> >
>>>> > Operational lineage collection is a common need to understand
>>>> dependencies between data pipelines and track end-to-end provenance of
>>>> data. It enables many use cases from ensuring reliable delivery of data
>>>> through observability to compliance and cost management.
>>>> >
>>>> > Publishing operational lineage is a core Airflow capability to enable
>>>> troubleshooting and governance.
>>>> >
>>>> > OpenLineage is a project part of the LFAI&Data foundation that
>>>> provides a spec standardizing operational lineage collection and sharing
>>>> across the data ecosystem. If it provides plugins for popular open source
>>>> projects, its intent is very similar to OpenTelemetry (also under the Linux
>>>> Foundation umbrella): to remain a spec for lineage exchange that projects -
>>>> open source or proprietary - implement.
>>>> >
>>>> > Built-in OpenLineage support in Airflow will make it easier and more
>>>> reliable for Airflow users to publish their operational lineage through the
>>>> OpenLineage ecosystem.
>>>> >
>>>> > The current external plugin maintained in the OpenLineage project
>>>> depends on Airflow and operators internals and gets broken when changes are
>>>> made on those. Having a built-in integration ensures a better first class
>>>> support to expose lineage that gets tested alongside other changes and
>>>> therefore is more stable.
>>>>
>>>
>>
>> --
>> Eugene
>>
>
>
> --
> Eugene
>

Reply via email to