General comment from my side: I think Open Lineage is (and should be
even more) a feature of Airflow that expands Airflow's capabilities
greatly and opens up the direction we've been all working on - Airflow
as a Platform.

I think closely integrating it with Open-Lineage goes the same
direction (also mentioned in the doc) as Open Telemetry goes, where we
might decide to support certain standards in order to expand
capabilities of Airflow-as-a-platform and allows to plug-in multiple
external solutions that would use the standard API. After Open-Lineage
graduated recently to  LFAI&Data foundation (I've been watching this
happening from far), it is I think the perfect candidate for Airflow
to incorporate it. I hope this will help all the players to make use
of the extra work necessary by the community to make it "officially
supported". I think we have to also get some feedback from the big
stakeholders in Airflow - because one thing is to have such a
capability, and another is to get it used in all the ways Airflow is
used - not only by on-premise/self-hosted users (which is obviously a
huge driving factor) but also everywhere where Airflow is exposed by
others - Astronomer is obviously on-board. we see some warm words from
Amazon (mentioned by Julian), I would love to hear whether the
Composer team at Google would be on board in using the open-lineage
information exposed this way in their Data Catalog (and likely more)
offering. We have Amundsen and others and possibly other stakeholders
might want to say something.


There is - undoubtedly - an extra effort involved in implementing and
keeping it running smoothly (as Julian mentioned, that is the main
reason why the Open Lineage community would like to make the
integration part of Airflow. But by being smart and integrating it in
the way that will allow to plug-it-in into our CI, verification
process and making some very clear expectations about what it means
for contributors to Airflow to get it running, we can make some
initial investment in making it happen and minimise on-going cost,
while maximising the gain.

And looking at all the above - I am super happy to help with all that
to make this easy to "swallow" and integrate well, even if it will
take an extra effort, especially that we will have experts from Open
Lineage who worked with both Airflow and Open Lineage being the core
part of the effort. I am actually super excited - this might be the
next-big-thing for Airflow to strengthen its position as an
indispensable component of "even more modern data stack".

I made my initial comments in the doc, and am looking forward to
making it happen :).

J.

On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
<jul...@astronomer.io.invalid> wrote:
>
> Dear Airflow Community,
> I have been working on a proposal to bring an OpenLineage provider to Airflow.
> I am looking for feedback with the goal to post an official AIP.
> Please feel free to comment in the doc above.
> Thank you,
> Julien (OpenLineage project lead)
>
> For convenience, here is the rationale from the doc:
>
> Operational lineage collection is a common need to understand dependencies 
> between data pipelines and track end-to-end provenance of data. It enables 
> many use cases from ensuring reliable delivery of data through observability 
> to compliance and cost management.
>
> Publishing operational lineage is a core Airflow capability to enable 
> troubleshooting and governance.
>
> OpenLineage is a project part of the LFAI&Data foundation that provides a 
> spec standardizing operational lineage collection and sharing across the data 
> ecosystem. If it provides plugins for popular open source projects, its 
> intent is very similar to OpenTelemetry (also under the Linux Foundation 
> umbrella): to remain a spec for lineage exchange that projects - open source 
> or proprietary - implement.
>
> Built-in OpenLineage support in Airflow will make it easier and more reliable 
> for Airflow users to publish their operational lineage through the 
> OpenLineage ecosystem.
>
> The current external plugin maintained in the OpenLineage project depends on 
> Airflow and operators internals and gets broken when changes are made on 
> those. Having a built-in integration ensures a better first class support to 
> expose lineage that gets tested alongside other changes and therefore is more 
> stable.

Reply via email to