Thank you Jarek,
I am happy to organize a zoom presentation about OpenLineage
<https://openlineage.io/> and answer any question. It is indeed a spec
decoupling the data transformation layer from the Metadata store people are
using. Just like OpenTelemetry is for service metrics/traces.
Best,
Julien

On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> And to add a little "parallel" - I think Open Lineage integration
> replacing our "generic lineage" is very similar step to the new
> "Multi-tenant"-ready authentication interface we are discussing in
> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
>
> Yes - we have a generic authentication interface, but no - it's useless
> for the case where multi-tenancy and good level of resource authorization
> is needed. It's just far too simplistic and limited.
>
> Same with current lineage generic interface - yes, we have it but it's
> only useful in a limited set of cases. and if we want to step-it-up we need
> to come up with something better (and Open Lineage happens to be one that
> has been developed with Airflow in mind and battle tested).
>
> J.
>
> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Hey Rafał (Eugene, Michal - and others who are looking),
>>
>> I think I know where your/Eugen/Michał concerns are coming from. And I
>> think it would be great if we can talk it over a bit.  I believe this is -
>> in parts - quite a misunderstanding of what Open Lineage really is, how
>> much of an integration it is and what are the reasons why it has been
>> implemented the way it was implemented in Airflow.
>>
>> **Idea**: (Julien -  Maybe you can organize it ?):
>>
>> Maybe we can have an open-to-everyone presentation/zoom call with quite
>> some time foreseen to ask questions where you would explain the community
>> about those integration points (and especially those people who are worried
>> we are losing something by choosing the OpenLineage integration). I would
>> love to see such a presentation - specifically focused on explaining how
>> Open-Lineage is really improving the current lineage approach and what
>> problems it solves that the existing generic interface doesn't.
>>
>> Just to set the tone and focus for such meeting if we have one:
>>
>> For me - when I look at Open Lineage, it is really "this is how lineage
>> generic interface **should** be done in Airflow". The "generic" lineage
>> support we have now is very, very basic, I'd even say far too simplistic. I
>> would even say, it's useless besides a few, very basic use cases. Simply
>> because there was never a good "receiver" of the information to cover those
>> cases.
>>
>> When you look closely at OpenLineage, it's nothing more than a better
>> convention of the dictionaries that we send as a metadata, better meta-data
>> in case of SQL operators (Hooks in the future hopefully), allowing handling
>> some cases that current lineage simply cannot.  Also what open-lineage
>> integration with Airflow covers better handling of the lifecycle "task" and
>> "dag" in Airflow to be able to bind lineage data together. That's my
>> understanding of what we get when we integrate OL in.
>>
>> I think over the last 2 years Datakin/Astronomer people had worked out
>> the level of interface that **just works** and if we would like to get the
>> lineage information from Airflow as useful as it is in OL, we would have to
>> anyway implement pretty much all of the things they already did.
>>
>> I would love (and I think many community members) to take part in such a
>> call to hear on that particular aspect of the OL integration.
>>
>> J.
>>
>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz
>> <rafalbieg...@google.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> I second/echo the input provided by Eugene and Michal.
>>>
>>> In general, Airflow should provide generic interfaces to lineage
>>> backends so it's easy to configure the one preferred by the user.
>>> Whether it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
>>> should be the user's choice.
>>>
>>> We should avoid close integration with any specific lineage backend due
>>> to the reasons already mentioned, i.e. to avoid translations between
>>> lineage backends. Also, we would closely couple one framework (Airflow)
>>> with another one (Open Lineage) - it makes Airflow more complex and less
>>> flexible. Loose coupling between lineage backends and Airflow seems to be
>>> more future-proven.
>>>
>>> Regards, Rafal.
>>>
>>>
>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
>>> <jul...@astronomer.io.invalid> wrote:
>>>
>>>> Dear Airflow community,
>>>> I have transferred the content of the working google doc I shared a few
>>>> weeks ago to the Airflow confluence:
>>>>
>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
>>>> All comments have been answered, I added clarifications to the doc
>>>> accordingly and I also added your suggestions to improve the proposal.
>>>> All that history is linked from the discussion thread link in the
>>>> confluence doc if you wish to consult it.
>>>> Thank you all for your feedback and help in the process.
>>>> Best
>>>> Julien
>>>>
>>>>
>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <jul...@astronomer.io>
>>>> wrote:
>>>>
>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
>>>>> I do agree with Jarek's assessment. I don't have very much to add to
>>>>> his argument, it is very thoughtful!
>>>>> OpenLineage was started to avoid the cartesian complexity that Eugene
>>>>> mentions. There's actually that specific illustration in the OpenLineage
>>>>> doc
>>>>> <https://openlineage.io/docs/#how-openlineage-benefits-the-ecosystem>.
>>>>> Lineage consumers want to avoid having to understand the lineage
>>>>> format of each individual observed data transformation layer. And
>>>>> transformation layers don't want to understand every Metadata store's 
>>>>> model
>>>>> and protocol.
>>>>> Eugene, about your specific proposal about a global vocabulary of
>>>>> entities, I think it is a great suggestion.
>>>>> We can map those entities to Datasets in OpenLineage. The way
>>>>> OpenLineage models this is by allowing specific facets attached to 
>>>>> Dataset. Facets
>>>>> are pieces of metadata <https://openlineage.io/docs/#core-model>each
>>>>> with their own JsonSchema.
>>>>> For example a table from a relational database will have a schema
>>>>> facet when a file in GCS might not.
>>>>> So I think in Airflow we could have each of the entity classes you
>>>>> describe be used in the get_openlineage_facets*() API in the Operators.
>>>>> Each of those classes would know what OpenLineage facets they can
>>>>> expose.
>>>>> I'll add a mention in the AIP and I think we can go in more details in
>>>>> a ticket.
>>>>> Cheers,
>>>>> Julien
>>>>>
>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com>
>>>>> wrote:
>>>>>
>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer will
>>>>>> be more thoughtful).
>>>>>>
>>>>>> I think you are right to the "agnostic" part. But I have one question
>>>>>> - what are we considering "agnostic"?
>>>>>>
>>>>>>  There is no "widespread" standard for lineage (yet). Open Lineage
>>>>>> with its donation to Linux Foundation Data & AI is aspiring to become
>>>>>> one. And it's a pretty good candidate:
>>>>>>
>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
>>>>>> published as an API from day one)
>>>>>> * as of recently, the ownership and governance of Open Lineage is with
>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)  which is
>>>>>> part of "Linux Foundation Project" - well known and respectful
>>>>>> foundation that - similarly to the ASF is an umbrella and provides
>>>>>> governance rules for a big number of well established OSS projects
>>>>>>
>>>>>> In essence it is the same approach as we already discussed and
>>>>>> approved for Open Telemetry (which is governed by CNCF which is in the
>>>>>> same league as recognition and governance to LFP) (not yet implemented
>>>>>> though). In the case of Open-Telemetry, we decided against developing
>>>>>> our "own" existing standard but we opted for one that is out there.
>>>>>> Yes it is a bit more established and popular than Open Lineage is, but
>>>>>> i so wish that we chose and implemented it already (and earlier as not
>>>>>> having a standard there - except statsd which is really, really poor)
>>>>>> has a great impact on Airflow being just "pluggable" in existing
>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I hear
>>>>>> (and see) there are attempts to do so).
>>>>>>
>>>>>> In the case of Open Lineage, the questions are - is there an
>>>>>> alternative of the same caliber? Shall we produce our own "agnostic
>>>>>> standard" for it instead ? Is there a chance the idea of
>>>>>> "airflow-specific" attributes will catch up and many "consumers" will
>>>>>> be writing their own conversions to the way they can consume it?
>>>>>>
>>>>>> I would really, really try to avoid the pitfalls nicely summarized
>>>>>> here: https://xkcd.com/927/
>>>>>>
>>>>>> We can of course make a wrong bet and in 2 years Airflow might be the
>>>>>> only one supporting Open Lineage. That might happen. Though the list
>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or maybe -
>>>>>> more likely - once Airflow implements it, due to Airflow's popularity
>>>>>> and the fact that there is already competition supporting it (e.g.
>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption of
>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the whole
>>>>>> ecosystem. I think we have a chance to influence creation of a new,
>>>>>> important standard. Much less so, I think if we just provide our own
>>>>>> custom solution - with lots and lots of work for others to be able to
>>>>>> consume it, no time to properly nurture the API and make it easier to
>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and now
>>>>>> LFData & AI run governance main focus is)
>>>>>>
>>>>>> Are there other alternatives we should consider ? Do we want to
>>>>>> develop our own standard (and implement all the integrations from the
>>>>>> grounds up) ?
>>>>>>
>>>>>> J.
>>>>>>
>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Julien.
>>>>>> >
>>>>>> > I reviewed the design doc.
>>>>>> > The general idea looks good to me, but I have some concerns that I
>>>>>> would like to share.
>>>>>> >
>>>>>> > If I understand correctly the proposed design is to fill in
>>>>>> "operators" with self-methods to extract lineage metadata from it, and I
>>>>>> agree with the motivation. If those are decoupled (in a form of 
>>>>>> extractors
>>>>>> in separate package) from operators itself, then the downsides is that 
>>>>>> (as
>>>>>> you mentioned) - extractors will be distributed separately and 
>>>>>> "operators"
>>>>>> logic is out of sync with "lineage extraction" logic by design.
>>>>>> > Also knowledge about internals of operator spills out of the
>>>>>> operator which is not good at all (at the very least).
>>>>>> >
>>>>>> > However, if we make every operator being exposing method to
>>>>>> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
>>>>>> then we will end up with cartesian complexity of supporting in each
>>>>>> provider+operator each backend format.
>>>>>> >
>>>>>> > If you say that the goal is that "operators" will always generate
>>>>>> OpenLineage format only and each consumer will convert this format to 
>>>>>> their
>>>>>> own internal representation, well, if they do this then this seems like a
>>>>>> working approach. But with the assumption that each consumer will support
>>>>>> it.
>>>>>> >
>>>>>> > I think it comes down to the question: is OpenLineage format enough
>>>>>> popular, complete and proper for the lineage metadata that every consumer
>>>>>> will be convinced to support it. We may also consider issues like 
>>>>>> mismatch
>>>>>> of lineage feature parity, e.g. OpenLineage supports field-level lineage
>>>>>> but consumer doesn't support (or not at the moment), so we would prefer
>>>>>> lineage metadata transferred to the backend to be slightly different in
>>>>>> this case.
>>>>>> >
>>>>>> > What do you think about the idea:
>>>>>> > 1. make lineage metadata generated by "operators" to be agnostic of
>>>>>> the specific format, just using entities from big generic vocabulary of
>>>>>> entities e.g. created here
>>>>>> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py.
>>>>>> We would have there e.g. entities like:
>>>>>> > --------------------------------------------------------------------
>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>>>>>> > class PostgresTable:
>>>>>> >     """Airflow lineage entity representing Postgres table."""
>>>>>> >
>>>>>> >     host: str = attr.ib()
>>>>>> >     port: str = attr.ib()
>>>>>> >     database: str = attr.ib()
>>>>>> >     schema: str = attr.ib()
>>>>>> >     table: str = attr.ib()
>>>>>> >
>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>>>>>> > class GCSEntity:
>>>>>> >     """Airflow lineage entity representing generic Google Cloud
>>>>>> Storage entity."""
>>>>>> >
>>>>>> >     bucket: str = attr.ib()
>>>>>> >     path: str = attr.ib()
>>>>>> >
>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>>>>>> > class AWSS3Entity:
>>>>>> >     """Airflow lineage entity representing generic AWS S3 entity."""
>>>>>> >
>>>>>> >     bucket: str = attr.ib()
>>>>>> >     path: str = attr.ib()
>>>>>> > --------------------------------------------------------------------
>>>>>> > 2. Implement "adapters" that will act as a bridge between
>>>>>> "operators" and backends. Their responsibility will be to convert lineage
>>>>>> metadata generated by "operators" to a format understandable by specific
>>>>>> backend.
>>>>>> > And then we can use the built-in mechanism of inlets/outlets to
>>>>>> bypass Airflow lineage metadata to the Airflow lineage backend.
>>>>>> >
>>>>>> > I didn't get exactly implementation details of your proposed
>>>>>> design, but I think maintaining global vocabulary of entities to use in
>>>>>> inlets/outlets of operators is crucial for Airflow, as this could be
>>>>>> leveraged to build various features on top of it, like displaying lineage
>>>>>> graph in Airflow UI (based on XCOM):)
>>>>>> >
>>>>>> > Importantly to note, if we decide to send out from Airflow lineage
>>>>>> metadata only in OpenLineage format, well, we could have than only one
>>>>>> "adapter" OpenLineageAdapter. But the "adapters" approach leaves us room
>>>>>> for adding support to others (following "pluggable" approach as Airflow 
>>>>>> is
>>>>>> mainly known/good about).
>>>>>> >
>>>>>> > All in all:
>>>>>> > - global vocabulary of entities used across all "operators" (with
>>>>>> all advantages out of it, mentioned above)
>>>>>> > - "adapters" approach
>>>>>> > seems to me crucial points in the design that make sense to me.
>>>>>> >
>>>>>> > What do you think about this?
>>>>>> >
>>>>>> > - Eugene
>>>>>> >
>>>>>> >
>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
>>>>>> <jul...@astronomer.io.invalid> wrote:
>>>>>> >>
>>>>>> >> Hello Michał,
>>>>>> >> Thank you for your input.
>>>>>> >> I would clarify that OpenLineage doesn't make any assumption about
>>>>>> the backend being used to store lineage and is an adapter-like layer.
>>>>>> >> OpenLineage exists as the spec specifically for that purpose of
>>>>>> avoiding the problem of every lineage consumer having to understand every
>>>>>> lineage producer.
>>>>>> >> Consumers of lineage want a unified spec consuming lineage from
>>>>>> any data transformation layer like Airflow, Spark, Flink, SQL, 
>>>>>> Warehouses,
>>>>>> ...
>>>>>> >> Just like OpenTelemetry allows consuming traces independently of
>>>>>> the technology used, so does OpenLineage for lineage.
>>>>>> >> Julien
>>>>>> >>
>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
>>>>>> michalmod...@google.com> wrote:
>>>>>> >>>
>>>>>> >>> Hi everyone,
>>>>>> >>>
>>>>>> >>> As Airflow already supports lineage functionality through
>>>>>> pluggable lineage backends, I think OpenLineage and other lineage systems
>>>>>> integration should follow this path. I think more 'native' integration 
>>>>>> with
>>>>>> OpenLineage (or any other lineage system) in Airflow while maintaining 
>>>>>> the
>>>>>> generic lineage backend architecture in parallel would make the user
>>>>>> experience less open, troublesome to maintain, and the Airflow 
>>>>>> architecture
>>>>>> itself more constrained by a logic of a specific system.
>>>>>> >>>
>>>>>> >>> I think enriching operators with a generic method exposing
>>>>>> lineage metadata that could be leveraged by lineage backends regardless 
>>>>>> of
>>>>>> their implementation is a good idea which the Cloud Composer team would
>>>>>> gladly contribute to. I believe the translation of the Airflow metadata
>>>>>> exposed by the operators should be done by lineage backends (or another
>>>>>> adapter-like layer). Tying Airflow operators' development to a specific
>>>>>> lineage system like OpenLineage forces operators' contributors to
>>>>>> understand that system too, which increases both the entry costs and
>>>>>> maintenance costs. I see it as unnecessary coupling.
>>>>>> >>>
>>>>>> >>> Best,
>>>>>> >>> Michal
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
>>>>>> jul...@astronomer.io> wrote:
>>>>>> >>>>
>>>>>> >>>> Thank you Eugen,
>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I
>>>>>> think this would work well.
>>>>>> >>>> Here are the sections in the doc that I think address your
>>>>>> points:
>>>>>> >>>> - generalize lineage metadata extraction as self-method in each
>>>>>> operator, using generic lineage entities
>>>>>> >>>> See: OpenLineage support in providers. It describes how each
>>>>>> operator exposes its lineage.
>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data
>>>>>> Lineage format, Open Lineage format, etc.
>>>>>> >>>> The goal here is each consumer turns from OpenLineage format to
>>>>>> their own internal representation as you are suggesting.
>>>>>> >>>> In the motivation section, towards the end, I link to a few
>>>>>> examples of data catalogs doing just that.
>>>>>> >>>>
>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <eu...@kosteev.com>
>>>>>> wrote:
>>>>>> >>>>>
>>>>>> >>>>> ++ Michal Modras
>>>>>> >>>>>
>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
>>>>>> eu...@kosteev.com> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with Dataplex"
>>>>>> feature which effectively means to generate lineage out of DAG/task
>>>>>> executions and export it to Data Lineage (Data Catalog service) for 
>>>>>> further
>>>>>> analysis.
>>>>>> >>>>>>
>>>>>> https://cloud.google.com/composer/docs/composer-2/lineage-integration
>>>>>> >>>>>>
>>>>>> >>>>>> This feature is as of now in the "Preview" state.
>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
>>>>>> backend" feature and methods to extract lineage metadata on task post
>>>>>> execution events.
>>>>>> >>>>>>
>>>>>> >>>>>> The general idea was to contribute this to the Airflow
>>>>>> community in a form:
>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in
>>>>>> each operator, using generic lineage entities
>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to Data
>>>>>> Lineage format, Open Lineage format, etc.
>>>>>> >>>>>>
>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean to
>>>>>> introduce an additional layer of converting from OpenLineage format to 
>>>>>> Data
>>>>>> Lineage (Data Catalog/Dataplex) format. But this is definitely a
>>>>>> possibility.
>>>>>> >>>>>>
>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
>>>>>> <jul...@astronomer.io.invalid> wrote:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Thank you very much for your input Jarek.
>>>>>> >>>>>>> I am responding in the comments and adding to the doc
>>>>>> accordingly.
>>>>>> >>>>>>> I would also love to hear from more stakeholders.
>>>>>> >>>>>>> Thanks to all who provided feedback so far.
>>>>>> >>>>>>> Julien
>>>>>> >>>>>>>
>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
>>>>>> ja...@potiuk.com> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is (and
>>>>>> should be
>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
>>>>>> capabilities
>>>>>> >>>>>>>> greatly and opens up the direction we've been all working on
>>>>>> - Airflow
>>>>>> >>>>>>>> as a Platform.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes the
>>>>>> same
>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
>>>>>> goes, where we
>>>>>> >>>>>>>> might decide to support certain standards in order to expand
>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to plug-in
>>>>>> multiple
>>>>>> >>>>>>>> external solutions that would use the standard API. After
>>>>>> Open-Lineage
>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been
>>>>>> watching this
>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate for
>>>>>> Airflow
>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players to
>>>>>> make use
>>>>>> >>>>>>>> of the extra work necessary by the community to make it
>>>>>> "officially
>>>>>> >>>>>>>> supported". I think we have to also get some feedback from
>>>>>> the big
>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have such a
>>>>>> >>>>>>>> capability, and another is to get it used in all the ways
>>>>>> Airflow is
>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which is
>>>>>> obviously a
>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow is
>>>>>> exposed by
>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some warm
>>>>>> words from
>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear whether
>>>>>> the
>>>>>> >>>>>>>> Composer team at Google would be on board in using the
>>>>>> open-lineage
>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
>>>>>> likely more)
>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other
>>>>>> stakeholders
>>>>>> >>>>>>>> might want to say something.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
>>>>>> implementing and
>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that is
>>>>>> the main
>>>>>> >>>>>>>> reason why the Open Lineage community would like to make the
>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
>>>>>> integrating it in
>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
>>>>>> verification
>>>>>> >>>>>>>> process and making some very clear expectations about what
>>>>>> it means
>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can make
>>>>>> some
>>>>>> >>>>>>>> initial investment in making it happen and minimise on-going
>>>>>> cost,
>>>>>> >>>>>>>> while maximising the gain.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> And looking at all the above - I am super happy to help with
>>>>>> all that
>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even if
>>>>>> it will
>>>>>> >>>>>>>> take an extra effort, especially that we will have experts
>>>>>> from Open
>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage being
>>>>>> the core
>>>>>> >>>>>>>> part of the effort. I am actually super excited - this might
>>>>>> be the
>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as an
>>>>>> >>>>>>>> indispensable component of "even more modern data stack".
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
>>>>>> forward to
>>>>>> >>>>>>>> making it happen :).
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> J.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
>>>>>> >>>>>>>> <jul...@astronomer.io.invalid> wrote:
>>>>>> >>>>>>>> >
>>>>>> >>>>>>>> > Dear Airflow Community,
>>>>>> >>>>>>>> > I have been working on a proposal to bring an OpenLineage
>>>>>> provider to Airflow.
>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
>>>>>> official AIP.
>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
>>>>>> >>>>>>>> > Thank you,
>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
>>>>>> >>>>>>>> >
>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
>>>>>> >>>>>>>> >
>>>>>> >>>>>>>> > Operational lineage collection is a common need to
>>>>>> understand dependencies between data pipelines and track end-to-end
>>>>>> provenance of data. It enables many use cases from ensuring reliable
>>>>>> delivery of data through observability to compliance and cost management.
>>>>>> >>>>>>>> >
>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
>>>>>> capability to enable troubleshooting and governance.
>>>>>> >>>>>>>> >
>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data foundation
>>>>>> that provides a spec standardizing operational lineage collection and
>>>>>> sharing across the data ecosystem. If it provides plugins for popular 
>>>>>> open
>>>>>> source projects, its intent is very similar to OpenTelemetry (also under
>>>>>> the Linux Foundation umbrella): to remain a spec for lineage exchange 
>>>>>> that
>>>>>> projects - open source or proprietary - implement.
>>>>>> >>>>>>>> >
>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
>>>>>> easier and more reliable for Airflow users to publish their operational
>>>>>> lineage through the OpenLineage ecosystem.
>>>>>> >>>>>>>> >
>>>>>> >>>>>>>> > The current external plugin maintained in the OpenLineage
>>>>>> project depends on Airflow and operators internals and gets broken when
>>>>>> changes are made on those. Having a built-in integration ensures a better
>>>>>> first class support to expose lineage that gets tested alongside other
>>>>>> changes and therefore is more stable.
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>> --
>>>>>> >>>>>> Eugene
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> --
>>>>>> >>>>> Eugene
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Eugene
>>>>>>
>>>>>

Reply via email to