Thank you Jarek, I am happy to organize a zoom presentation about OpenLineage <https://openlineage.io/> and answer any question. It is indeed a spec decoupling the data transformation layer from the Metadata store people are using. Just like OpenTelemetry is for service metrics/traces. Best, Julien
On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote: > And to add a little "parallel" - I think Open Lineage integration > replacing our "generic lineage" is very similar step to the new > "Multi-tenant"-ready authentication interface we are discussing in > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck > > Yes - we have a generic authentication interface, but no - it's useless > for the case where multi-tenancy and good level of resource authorization > is needed. It's just far too simplistic and limited. > > Same with current lineage generic interface - yes, we have it but it's > only useful in a limited set of cases. and if we want to step-it-up we need > to come up with something better (and Open Lineage happens to be one that > has been developed with Airflow in mind and battle tested). > > J. > > On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote: > >> Hey Rafał (Eugene, Michal - and others who are looking), >> >> I think I know where your/Eugen/Michał concerns are coming from. And I >> think it would be great if we can talk it over a bit. I believe this is - >> in parts - quite a misunderstanding of what Open Lineage really is, how >> much of an integration it is and what are the reasons why it has been >> implemented the way it was implemented in Airflow. >> >> **Idea**: (Julien - Maybe you can organize it ?): >> >> Maybe we can have an open-to-everyone presentation/zoom call with quite >> some time foreseen to ask questions where you would explain the community >> about those integration points (and especially those people who are worried >> we are losing something by choosing the OpenLineage integration). I would >> love to see such a presentation - specifically focused on explaining how >> Open-Lineage is really improving the current lineage approach and what >> problems it solves that the existing generic interface doesn't. >> >> Just to set the tone and focus for such meeting if we have one: >> >> For me - when I look at Open Lineage, it is really "this is how lineage >> generic interface **should** be done in Airflow". The "generic" lineage >> support we have now is very, very basic, I'd even say far too simplistic. I >> would even say, it's useless besides a few, very basic use cases. Simply >> because there was never a good "receiver" of the information to cover those >> cases. >> >> When you look closely at OpenLineage, it's nothing more than a better >> convention of the dictionaries that we send as a metadata, better meta-data >> in case of SQL operators (Hooks in the future hopefully), allowing handling >> some cases that current lineage simply cannot. Also what open-lineage >> integration with Airflow covers better handling of the lifecycle "task" and >> "dag" in Airflow to be able to bind lineage data together. That's my >> understanding of what we get when we integrate OL in. >> >> I think over the last 2 years Datakin/Astronomer people had worked out >> the level of interface that **just works** and if we would like to get the >> lineage information from Airflow as useful as it is in OL, we would have to >> anyway implement pretty much all of the things they already did. >> >> I would love (and I think many community members) to take part in such a >> call to hear on that particular aspect of the OL integration. >> >> J. >> >> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz >> <rafalbieg...@google.com.invalid> wrote: >> >>> Hi, >>> >>> I second/echo the input provided by Eugene and Michal. >>> >>> In general, Airflow should provide generic interfaces to lineage >>> backends so it's easy to configure the one preferred by the user. >>> Whether it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it >>> should be the user's choice. >>> >>> We should avoid close integration with any specific lineage backend due >>> to the reasons already mentioned, i.e. to avoid translations between >>> lineage backends. Also, we would closely couple one framework (Airflow) >>> with another one (Open Lineage) - it makes Airflow more complex and less >>> flexible. Loose coupling between lineage backends and Airflow seems to be >>> more future-proven. >>> >>> Regards, Rafal. >>> >>> >>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem >>> <jul...@astronomer.io.invalid> wrote: >>> >>>> Dear Airflow community, >>>> I have transferred the content of the working google doc I shared a few >>>> weeks ago to the Airflow confluence: >>>> >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow >>>> All comments have been answered, I added clarifications to the doc >>>> accordingly and I also added your suggestions to improve the proposal. >>>> All that history is linked from the discussion thread link in the >>>> confluence doc if you wish to consult it. >>>> Thank you all for your feedback and help in the process. >>>> Best >>>> Julien >>>> >>>> >>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <jul...@astronomer.io> >>>> wrote: >>>> >>>>> Thank you for the email Jarek, and Eugene for your suggestions, >>>>> I do agree with Jarek's assessment. I don't have very much to add to >>>>> his argument, it is very thoughtful! >>>>> OpenLineage was started to avoid the cartesian complexity that Eugene >>>>> mentions. There's actually that specific illustration in the OpenLineage >>>>> doc >>>>> <https://openlineage.io/docs/#how-openlineage-benefits-the-ecosystem>. >>>>> Lineage consumers want to avoid having to understand the lineage >>>>> format of each individual observed data transformation layer. And >>>>> transformation layers don't want to understand every Metadata store's >>>>> model >>>>> and protocol. >>>>> Eugene, about your specific proposal about a global vocabulary of >>>>> entities, I think it is a great suggestion. >>>>> We can map those entities to Datasets in OpenLineage. The way >>>>> OpenLineage models this is by allowing specific facets attached to >>>>> Dataset. Facets >>>>> are pieces of metadata <https://openlineage.io/docs/#core-model>each >>>>> with their own JsonSchema. >>>>> For example a table from a relational database will have a schema >>>>> facet when a file in GCS might not. >>>>> So I think in Airflow we could have each of the entity classes you >>>>> describe be used in the get_openlineage_facets*() API in the Operators. >>>>> Each of those classes would know what OpenLineage facets they can >>>>> expose. >>>>> I'll add a mention in the AIP and I think we can go in more details in >>>>> a ticket. >>>>> Cheers, >>>>> Julien >>>>> >>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com> >>>>> wrote: >>>>> >>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer will >>>>>> be more thoughtful). >>>>>> >>>>>> I think you are right to the "agnostic" part. But I have one question >>>>>> - what are we considering "agnostic"? >>>>>> >>>>>> There is no "widespread" standard for lineage (yet). Open Lineage >>>>>> with its donation to Linux Foundation Data & AI is aspiring to become >>>>>> one. And it's a pretty good candidate: >>>>>> >>>>>> * designed from grounds-up to be agnostic (Open Lineage was only >>>>>> published as an API from day one) >>>>>> * as of recently, the ownership and governance of Open Lineage is with >>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/) which is >>>>>> part of "Linux Foundation Project" - well known and respectful >>>>>> foundation that - similarly to the ASF is an umbrella and provides >>>>>> governance rules for a big number of well established OSS projects >>>>>> >>>>>> In essence it is the same approach as we already discussed and >>>>>> approved for Open Telemetry (which is governed by CNCF which is in the >>>>>> same league as recognition and governance to LFP) (not yet implemented >>>>>> though). In the case of Open-Telemetry, we decided against developing >>>>>> our "own" existing standard but we opted for one that is out there. >>>>>> Yes it is a bit more established and popular than Open Lineage is, but >>>>>> i so wish that we chose and implemented it already (and earlier as not >>>>>> having a standard there - except statsd which is really, really poor) >>>>>> has a great impact on Airflow being just "pluggable" in existing >>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I hear >>>>>> (and see) there are attempts to do so). >>>>>> >>>>>> In the case of Open Lineage, the questions are - is there an >>>>>> alternative of the same caliber? Shall we produce our own "agnostic >>>>>> standard" for it instead ? Is there a chance the idea of >>>>>> "airflow-specific" attributes will catch up and many "consumers" will >>>>>> be writing their own conversions to the way they can consume it? >>>>>> >>>>>> I would really, really try to avoid the pitfalls nicely summarized >>>>>> here: https://xkcd.com/927/ >>>>>> >>>>>> We can of course make a wrong bet and in 2 years Airflow might be the >>>>>> only one supporting Open Lineage. That might happen. Though the list >>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or maybe - >>>>>> more likely - once Airflow implements it, due to Airflow's popularity >>>>>> and the fact that there is already competition supporting it (e.g. >>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption of >>>>>> Open Lineage. My bet is - the latter and for the benefit of the whole >>>>>> ecosystem. I think we have a chance to influence creation of a new, >>>>>> important standard. Much less so, I think if we just provide our own >>>>>> custom solution - with lots and lots of work for others to be able to >>>>>> consume it, no time to properly nurture the API and make it easier to >>>>>> implement it (which is undoubtedly what Datakin, Astronomer and now >>>>>> LFData & AI run governance main focus is) >>>>>> >>>>>> Are there other alternatives we should consider ? Do we want to >>>>>> develop our own standard (and implement all the integrations from the >>>>>> grounds up) ? >>>>>> >>>>>> J. >>>>>> >>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com> >>>>>> wrote: >>>>>> > >>>>>> > Hi Julien. >>>>>> > >>>>>> > I reviewed the design doc. >>>>>> > The general idea looks good to me, but I have some concerns that I >>>>>> would like to share. >>>>>> > >>>>>> > If I understand correctly the proposed design is to fill in >>>>>> "operators" with self-methods to extract lineage metadata from it, and I >>>>>> agree with the motivation. If those are decoupled (in a form of >>>>>> extractors >>>>>> in separate package) from operators itself, then the downsides is that >>>>>> (as >>>>>> you mentioned) - extractors will be distributed separately and >>>>>> "operators" >>>>>> logic is out of sync with "lineage extraction" logic by design. >>>>>> > Also knowledge about internals of operator spills out of the >>>>>> operator which is not good at all (at the very least). >>>>>> > >>>>>> > However, if we make every operator being exposing method to >>>>>> generate lineage metadata of the specific format, e.g. OpenLineage etc., >>>>>> then we will end up with cartesian complexity of supporting in each >>>>>> provider+operator each backend format. >>>>>> > >>>>>> > If you say that the goal is that "operators" will always generate >>>>>> OpenLineage format only and each consumer will convert this format to >>>>>> their >>>>>> own internal representation, well, if they do this then this seems like a >>>>>> working approach. But with the assumption that each consumer will support >>>>>> it. >>>>>> > >>>>>> > I think it comes down to the question: is OpenLineage format enough >>>>>> popular, complete and proper for the lineage metadata that every consumer >>>>>> will be convinced to support it. We may also consider issues like >>>>>> mismatch >>>>>> of lineage feature parity, e.g. OpenLineage supports field-level lineage >>>>>> but consumer doesn't support (or not at the moment), so we would prefer >>>>>> lineage metadata transferred to the backend to be slightly different in >>>>>> this case. >>>>>> > >>>>>> > What do you think about the idea: >>>>>> > 1. make lineage metadata generated by "operators" to be agnostic of >>>>>> the specific format, just using entities from big generic vocabulary of >>>>>> entities e.g. created here >>>>>> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py. >>>>>> We would have there e.g. entities like: >>>>>> > -------------------------------------------------------------------- >>>>>> > @attr.s(auto_attribs=True, kw_only=True) >>>>>> > class PostgresTable: >>>>>> > """Airflow lineage entity representing Postgres table.""" >>>>>> > >>>>>> > host: str = attr.ib() >>>>>> > port: str = attr.ib() >>>>>> > database: str = attr.ib() >>>>>> > schema: str = attr.ib() >>>>>> > table: str = attr.ib() >>>>>> > >>>>>> > @attr.s(auto_attribs=True, kw_only=True) >>>>>> > class GCSEntity: >>>>>> > """Airflow lineage entity representing generic Google Cloud >>>>>> Storage entity.""" >>>>>> > >>>>>> > bucket: str = attr.ib() >>>>>> > path: str = attr.ib() >>>>>> > >>>>>> > @attr.s(auto_attribs=True, kw_only=True) >>>>>> > class AWSS3Entity: >>>>>> > """Airflow lineage entity representing generic AWS S3 entity.""" >>>>>> > >>>>>> > bucket: str = attr.ib() >>>>>> > path: str = attr.ib() >>>>>> > -------------------------------------------------------------------- >>>>>> > 2. Implement "adapters" that will act as a bridge between >>>>>> "operators" and backends. Their responsibility will be to convert lineage >>>>>> metadata generated by "operators" to a format understandable by specific >>>>>> backend. >>>>>> > And then we can use the built-in mechanism of inlets/outlets to >>>>>> bypass Airflow lineage metadata to the Airflow lineage backend. >>>>>> > >>>>>> > I didn't get exactly implementation details of your proposed >>>>>> design, but I think maintaining global vocabulary of entities to use in >>>>>> inlets/outlets of operators is crucial for Airflow, as this could be >>>>>> leveraged to build various features on top of it, like displaying lineage >>>>>> graph in Airflow UI (based on XCOM):) >>>>>> > >>>>>> > Importantly to note, if we decide to send out from Airflow lineage >>>>>> metadata only in OpenLineage format, well, we could have than only one >>>>>> "adapter" OpenLineageAdapter. But the "adapters" approach leaves us room >>>>>> for adding support to others (following "pluggable" approach as Airflow >>>>>> is >>>>>> mainly known/good about). >>>>>> > >>>>>> > All in all: >>>>>> > - global vocabulary of entities used across all "operators" (with >>>>>> all advantages out of it, mentioned above) >>>>>> > - "adapters" approach >>>>>> > seems to me crucial points in the design that make sense to me. >>>>>> > >>>>>> > What do you think about this? >>>>>> > >>>>>> > - Eugene >>>>>> > >>>>>> > >>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem >>>>>> <jul...@astronomer.io.invalid> wrote: >>>>>> >> >>>>>> >> Hello Michał, >>>>>> >> Thank you for your input. >>>>>> >> I would clarify that OpenLineage doesn't make any assumption about >>>>>> the backend being used to store lineage and is an adapter-like layer. >>>>>> >> OpenLineage exists as the spec specifically for that purpose of >>>>>> avoiding the problem of every lineage consumer having to understand every >>>>>> lineage producer. >>>>>> >> Consumers of lineage want a unified spec consuming lineage from >>>>>> any data transformation layer like Airflow, Spark, Flink, SQL, >>>>>> Warehouses, >>>>>> ... >>>>>> >> Just like OpenTelemetry allows consuming traces independently of >>>>>> the technology used, so does OpenLineage for lineage. >>>>>> >> Julien >>>>>> >> >>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras < >>>>>> michalmod...@google.com> wrote: >>>>>> >>> >>>>>> >>> Hi everyone, >>>>>> >>> >>>>>> >>> As Airflow already supports lineage functionality through >>>>>> pluggable lineage backends, I think OpenLineage and other lineage systems >>>>>> integration should follow this path. I think more 'native' integration >>>>>> with >>>>>> OpenLineage (or any other lineage system) in Airflow while maintaining >>>>>> the >>>>>> generic lineage backend architecture in parallel would make the user >>>>>> experience less open, troublesome to maintain, and the Airflow >>>>>> architecture >>>>>> itself more constrained by a logic of a specific system. >>>>>> >>> >>>>>> >>> I think enriching operators with a generic method exposing >>>>>> lineage metadata that could be leveraged by lineage backends regardless >>>>>> of >>>>>> their implementation is a good idea which the Cloud Composer team would >>>>>> gladly contribute to. I believe the translation of the Airflow metadata >>>>>> exposed by the operators should be done by lineage backends (or another >>>>>> adapter-like layer). Tying Airflow operators' development to a specific >>>>>> lineage system like OpenLineage forces operators' contributors to >>>>>> understand that system too, which increases both the entry costs and >>>>>> maintenance costs. I see it as unnecessary coupling. >>>>>> >>> >>>>>> >>> Best, >>>>>> >>> Michal >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem < >>>>>> jul...@astronomer.io> wrote: >>>>>> >>>> >>>>>> >>>> Thank you Eugen, >>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I >>>>>> think this would work well. >>>>>> >>>> Here are the sections in the doc that I think address your >>>>>> points: >>>>>> >>>> - generalize lineage metadata extraction as self-method in each >>>>>> operator, using generic lineage entities >>>>>> >>>> See: OpenLineage support in providers. It describes how each >>>>>> operator exposes its lineage. >>>>>> >>>> - implement "adapter"s to convert generated metadata to Data >>>>>> Lineage format, Open Lineage format, etc. >>>>>> >>>> The goal here is each consumer turns from OpenLineage format to >>>>>> their own internal representation as you are suggesting. >>>>>> >>>> In the motivation section, towards the end, I link to a few >>>>>> examples of data catalogs doing just that. >>>>>> >>>> >>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <eu...@kosteev.com> >>>>>> wrote: >>>>>> >>>>> >>>>>> >>>>> ++ Michal Modras >>>>>> >>>>> >>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev < >>>>>> eu...@kosteev.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Cloud Composer recently launched "Data lineage with Dataplex" >>>>>> feature which effectively means to generate lineage out of DAG/task >>>>>> executions and export it to Data Lineage (Data Catalog service) for >>>>>> further >>>>>> analysis. >>>>>> >>>>>> >>>>>> https://cloud.google.com/composer/docs/composer-2/lineage-integration >>>>>> >>>>>> >>>>>> >>>>>> This feature is as of now in the "Preview" state. >>>>>> >>>>>> The current implementation uses built-in "Airflow lineage >>>>>> backend" feature and methods to extract lineage metadata on task post >>>>>> execution events. >>>>>> >>>>>> >>>>>> >>>>>> The general idea was to contribute this to the Airflow >>>>>> community in a form: >>>>>> >>>>>> - generalize lineage metadata extraction as self-method in >>>>>> each operator, using generic lineage entities >>>>>> >>>>>> - implement "adapter"s to convert generated metadata to Data >>>>>> Lineage format, Open Lineage format, etc. >>>>>> >>>>>> >>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean to >>>>>> introduce an additional layer of converting from OpenLineage format to >>>>>> Data >>>>>> Lineage (Data Catalog/Dataplex) format. But this is definitely a >>>>>> possibility. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem >>>>>> <jul...@astronomer.io.invalid> wrote: >>>>>> >>>>>>> >>>>>> >>>>>>> Thank you very much for your input Jarek. >>>>>> >>>>>>> I am responding in the comments and adding to the doc >>>>>> accordingly. >>>>>> >>>>>>> I would also love to hear from more stakeholders. >>>>>> >>>>>>> Thanks to all who provided feedback so far. >>>>>> >>>>>>> Julien >>>>>> >>>>>>> >>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk < >>>>>> ja...@potiuk.com> wrote: >>>>>> >>>>>>>> >>>>>> >>>>>>>> General comment from my side: I think Open Lineage is (and >>>>>> should be >>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's >>>>>> capabilities >>>>>> >>>>>>>> greatly and opens up the direction we've been all working on >>>>>> - Airflow >>>>>> >>>>>>>> as a Platform. >>>>>> >>>>>>>> >>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes the >>>>>> same >>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry >>>>>> goes, where we >>>>>> >>>>>>>> might decide to support certain standards in order to expand >>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to plug-in >>>>>> multiple >>>>>> >>>>>>>> external solutions that would use the standard API. After >>>>>> Open-Lineage >>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've been >>>>>> watching this >>>>>> >>>>>>>> happening from far), it is I think the perfect candidate for >>>>>> Airflow >>>>>> >>>>>>>> to incorporate it. I hope this will help all the players to >>>>>> make use >>>>>> >>>>>>>> of the extra work necessary by the community to make it >>>>>> "officially >>>>>> >>>>>>>> supported". I think we have to also get some feedback from >>>>>> the big >>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have such a >>>>>> >>>>>>>> capability, and another is to get it used in all the ways >>>>>> Airflow is >>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which is >>>>>> obviously a >>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow is >>>>>> exposed by >>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some warm >>>>>> words from >>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear whether >>>>>> the >>>>>> >>>>>>>> Composer team at Google would be on board in using the >>>>>> open-lineage >>>>>> >>>>>>>> information exposed this way in their Data Catalog (and >>>>>> likely more) >>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other >>>>>> stakeholders >>>>>> >>>>>>>> might want to say something. >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in >>>>>> implementing and >>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that is >>>>>> the main >>>>>> >>>>>>>> reason why the Open Lineage community would like to make the >>>>>> >>>>>>>> integration part of Airflow. But by being smart and >>>>>> integrating it in >>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI, >>>>>> verification >>>>>> >>>>>>>> process and making some very clear expectations about what >>>>>> it means >>>>>> >>>>>>>> for contributors to Airflow to get it running, we can make >>>>>> some >>>>>> >>>>>>>> initial investment in making it happen and minimise on-going >>>>>> cost, >>>>>> >>>>>>>> while maximising the gain. >>>>>> >>>>>>>> >>>>>> >>>>>>>> And looking at all the above - I am super happy to help with >>>>>> all that >>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even if >>>>>> it will >>>>>> >>>>>>>> take an extra effort, especially that we will have experts >>>>>> from Open >>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage being >>>>>> the core >>>>>> >>>>>>>> part of the effort. I am actually super excited - this might >>>>>> be the >>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as an >>>>>> >>>>>>>> indispensable component of "even more modern data stack". >>>>>> >>>>>>>> >>>>>> >>>>>>>> I made my initial comments in the doc, and am looking >>>>>> forward to >>>>>> >>>>>>>> making it happen :). >>>>>> >>>>>>>> >>>>>> >>>>>>>> J. >>>>>> >>>>>>>> >>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem >>>>>> >>>>>>>> <jul...@astronomer.io.invalid> wrote: >>>>>> >>>>>>>> > >>>>>> >>>>>>>> > Dear Airflow Community, >>>>>> >>>>>>>> > I have been working on a proposal to bring an OpenLineage >>>>>> provider to Airflow. >>>>>> >>>>>>>> > I am looking for feedback with the goal to post an >>>>>> official AIP. >>>>>> >>>>>>>> > Please feel free to comment in the doc above. >>>>>> >>>>>>>> > Thank you, >>>>>> >>>>>>>> > Julien (OpenLineage project lead) >>>>>> >>>>>>>> > >>>>>> >>>>>>>> > For convenience, here is the rationale from the doc: >>>>>> >>>>>>>> > >>>>>> >>>>>>>> > Operational lineage collection is a common need to >>>>>> understand dependencies between data pipelines and track end-to-end >>>>>> provenance of data. It enables many use cases from ensuring reliable >>>>>> delivery of data through observability to compliance and cost management. >>>>>> >>>>>>>> > >>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow >>>>>> capability to enable troubleshooting and governance. >>>>>> >>>>>>>> > >>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data foundation >>>>>> that provides a spec standardizing operational lineage collection and >>>>>> sharing across the data ecosystem. If it provides plugins for popular >>>>>> open >>>>>> source projects, its intent is very similar to OpenTelemetry (also under >>>>>> the Linux Foundation umbrella): to remain a spec for lineage exchange >>>>>> that >>>>>> projects - open source or proprietary - implement. >>>>>> >>>>>>>> > >>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it >>>>>> easier and more reliable for Airflow users to publish their operational >>>>>> lineage through the OpenLineage ecosystem. >>>>>> >>>>>>>> > >>>>>> >>>>>>>> > The current external plugin maintained in the OpenLineage >>>>>> project depends on Airflow and operators internals and gets broken when >>>>>> changes are made on those. Having a built-in integration ensures a better >>>>>> first class support to expose lineage that gets tested alongside other >>>>>> changes and therefore is more stable. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Eugene >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> -- >>>>>> >>>>> Eugene >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Eugene >>>>>> >>>>>