Hi Julien.

Can you, please, include me there as well: [email protected] or
[email protected].
Looking forward to see presentation.

- Eugene

On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem <[email protected]>
wrote:

> Hello all,
> I have to move the OpenLineage presentation to next week.
> Sorry for the change.
> It will be Friday next week March 31st at 5pm CET 9am PT.
>
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> Julien
>
> On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <[email protected]>
> wrote:
>
> > We are planning to do this session next Thursday at 5pm CET 9am PT. I
> will
> > send a zoom link in advance.
> > Julien
> >
> > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <[email protected]> wrote:
> >
> >> Cool. I am looking forward to it :). It would be great to get some
> >> insight from those who attempted to get the lineage working in several
> >> versions of Open Lineage and finally arrived at the current
> >> specs/integration.
> >>
> >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> >> <[email protected]> wrote:
> >> >
> >> > Thank you Jarek,
> >> > I am happy to organize a zoom presentation about OpenLineage and
> answer
> >> any question. It is indeed a spec decoupling the data transformation
> layer
> >> from the Metadata store people are using. Just like OpenTelemetry is for
> >> service metrics/traces.
> >> > Best,
> >> > Julien
> >> >
> >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <[email protected]>
> wrote:
> >> >>
> >> >> And to add a little "parallel" - I think Open Lineage integration
> >> replacing our "generic lineage" is very similar step to the new
> >> "Multi-tenant"-ready authentication interface we are discussing in
> >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> >> >>
> >> >> Yes - we have a generic authentication interface, but no - it's
> >> useless for the case where multi-tenancy and good level of resource
> >> authorization is needed. It's just far too simplistic and limited.
> >> >>
> >> >> Same with current lineage generic interface - yes, we have it but
> it's
> >> only useful in a limited set of cases. and if we want to step-it-up we
> need
> >> to come up with something better (and Open Lineage happens to be one
> that
> >> has been developed with Airflow in mind and battle tested).
> >> >>
> >> >> J.
> >> >>
> >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <[email protected]>
> wrote:
> >> >>>
> >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> >> >>>
> >> >>> I think I know where your/Eugen/Michał concerns are coming from. And
> >> I think it would be great if we can talk it over a bit.  I believe this
> is
> >> - in parts - quite a misunderstanding of what Open Lineage really is,
> how
> >> much of an integration it is and what are the reasons why it has been
> >> implemented the way it was implemented in Airflow.
> >> >>>
> >> >>> **Idea**: (Julien -  Maybe you can organize it ?):
> >> >>>
> >> >>> Maybe we can have an open-to-everyone presentation/zoom call with
> >> quite some time foreseen to ask questions where you would explain the
> >> community about those integration points (and especially those people
> who
> >> are worried we are losing something by choosing the OpenLineage
> >> integration). I would love to see such a presentation - specifically
> >> focused on explaining how Open-Lineage is really improving the current
> >> lineage approach and what problems it solves that the existing generic
> >> interface doesn't.
> >> >>>
> >> >>> Just to set the tone and focus for such meeting if we have one:
> >> >>>
> >> >>> For me - when I look at Open Lineage, it is really "this is how
> >> lineage generic interface **should** be done in Airflow". The "generic"
> >> lineage support we have now is very, very basic, I'd even say far too
> >> simplistic. I would even say, it's useless besides a few, very basic use
> >> cases. Simply because there was never a good "receiver" of the
> information
> >> to cover those cases.
> >> >>>
> >> >>> When you look closely at OpenLineage, it's nothing more than a
> better
> >> convention of the dictionaries that we send as a metadata, better
> meta-data
> >> in case of SQL operators (Hooks in the future hopefully), allowing
> handling
> >> some cases that current lineage simply cannot.  Also what open-lineage
> >> integration with Airflow covers better handling of the lifecycle "task"
> and
> >> "dag" in Airflow to be able to bind lineage data together. That's my
> >> understanding of what we get when we integrate OL in.
> >> >>>
> >> >>> I think over the last 2 years Datakin/Astronomer people had worked
> >> out the level of interface that **just works** and if we would like to
> get
> >> the lineage information from Airflow as useful as it is in OL, we would
> >> have to anyway implement pretty much all of the things they already did.
> >> >>>
> >> >>> I would love (and I think many community members) to take part in
> >> such a call to hear on that particular aspect of the OL integration.
> >> >>>
> >> >>> J.
> >> >>>
> >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> >> [email protected]> wrote:
> >> >>>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> I second/echo the input provided by Eugene and Michal.
> >> >>>>
> >> >>>> In general, Airflow should provide generic interfaces to lineage
> >> backends so it's easy to configure the one preferred by the user.
> Whether
> >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> should
> >> be the user's choice.
> >> >>>>
> >> >>>> We should avoid close integration with any specific lineage backend
> >> due to the reasons already mentioned, i.e. to avoid translations between
> >> lineage backends. Also, we would closely couple one framework (Airflow)
> >> with another one (Open Lineage) - it makes Airflow more complex and less
> >> flexible. Loose coupling between lineage backends and Airflow seems to
> be
> >> more future-proven.
> >> >>>>
> >> >>>> Regards, Rafal.
> >> >>>>
> >> >>>>
> >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> >> <[email protected]> wrote:
> >> >>>>>
> >> >>>>> Dear Airflow community,
> >> >>>>> I have transferred the content of the working google doc I shared
> a
> >> few weeks ago to the Airflow confluence:
> >> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> >> >>>>> All comments have been answered, I added clarifications to the doc
> >> accordingly and I also added your suggestions to improve the proposal.
> >> >>>>> All that history is linked from the discussion thread link in the
> >> confluence doc if you wish to consult it.
> >> >>>>> Thank you all for your feedback and help in the process.
> >> >>>>> Best
> >> >>>>> Julien
> >> >>>>>
> >> >>>>>
> >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> [email protected]>
> >> wrote:
> >> >>>>>>
> >> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
> >> >>>>>> I do agree with Jarek's assessment. I don't have very much to add
> >> to his argument, it is very thoughtful!
> >> >>>>>> OpenLineage was started to avoid the cartesian complexity that
> >> Eugene mentions. There's actually that specific illustration in the
> >> OpenLineage doc.
> >> >>>>>> Lineage consumers want to avoid having to understand the lineage
> >> format of each individual observed data transformation layer. And
> >> transformation layers don't want to understand every Metadata store's
> model
> >> and protocol.
> >> >>>>>> Eugene, about your specific proposal about a global vocabulary of
> >> entities, I think it is a great suggestion.
> >> >>>>>> We can map those entities to Datasets in OpenLineage. The way
> >> OpenLineage models this is by allowing specific facets attached to
> Dataset.
> >> Facets are pieces of metadata each with their own JsonSchema.
> >> >>>>>> For example a table from a relational database will have a schema
> >> facet when a file in GCS might not.
> >> >>>>>> So I think in Airflow we could have each of the entity classes
> you
> >> describe be used in the get_openlineage_facets*() API in the Operators.
> >> >>>>>> Each of those classes would know what OpenLineage facets they can
> >> expose.
> >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> >> details in a ticket.
> >> >>>>>> Cheers,
> >> >>>>>> Julien
> >> >>>>>>
> >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <[email protected]>
> >> wrote:
> >> >>>>>>>
> >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer
> >> will
> >> >>>>>>> be more thoughtful).
> >> >>>>>>>
> >> >>>>>>> I think you are right to the "agnostic" part. But I have one
> >> question
> >> >>>>>>> - what are we considering "agnostic"?
> >> >>>>>>>
> >> >>>>>>>  There is no "widespread" standard for lineage (yet). Open
> Lineage
> >> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to
> >> become
> >> >>>>>>> one. And it's a pretty good candidate:
> >> >>>>>>>
> >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
> >> >>>>>>> published as an API from day one)
> >> >>>>>>> * as of recently, the ownership and governance of Open Lineage
> is
> >> with
> >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)
> which
> >> is
> >> >>>>>>> part of "Linux Foundation Project" - well known and respectful
> >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> provides
> >> >>>>>>> governance rules for a big number of well established OSS
> projects
> >> >>>>>>>
> >> >>>>>>> In essence it is the same approach as we already discussed and
> >> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is
> >> in the
> >> >>>>>>> same league as recognition and governance to LFP) (not yet
> >> implemented
> >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> >> developing
> >> >>>>>>> our "own" existing standard but we opted for one that is out
> >> there.
> >> >>>>>>> Yes it is a bit more established and popular than Open Lineage
> >> is, but
> >> >>>>>>> i so wish that we chose and implemented it already (and earlier
> >> as not
> >> >>>>>>> having a standard there - except statsd which is really, really
> >> poor)
> >> >>>>>>> has a great impact on Airflow being just "pluggable" in existing
> >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and
> I
> >> hear
> >> >>>>>>> (and see) there are attempts to do so).
> >> >>>>>>>
> >> >>>>>>> In the case of Open Lineage, the questions are - is there an
> >> >>>>>>> alternative of the same caliber? Shall we produce our own
> >> "agnostic
> >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> >> >>>>>>> "airflow-specific" attributes will catch up and many "consumers"
> >> will
> >> >>>>>>> be writing their own conversions to the way they can consume it?
> >> >>>>>>>
> >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> summarized
> >> >>>>>>> here: https://xkcd.com/927/
> >> >>>>>>>
> >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might
> be
> >> the
> >> >>>>>>> only one supporting Open Lineage. That might happen. Though the
> >> list
> >> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or
> >> maybe -
> >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> >> popularity
> >> >>>>>>> and the fact that there is already competition supporting it
> (e.g.
> >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption
> >> of
> >> >>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the
> >> whole
> >> >>>>>>> ecosystem. I think we have a chance to influence creation of a
> >> new,
> >> >>>>>>> important standard. Much less so, I think if we just provide our
> >> own
> >> >>>>>>> custom solution - with lots and lots of work for others to be
> >> able to
> >> >>>>>>> consume it, no time to properly nurture the API and make it
> >> easier to
> >> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and
> >> now
> >> >>>>>>> LFData & AI run governance main focus is)
> >> >>>>>>>
> >> >>>>>>> Are there other alternatives we should consider ? Do we want to
> >> >>>>>>> develop our own standard (and implement all the integrations
> from
> >> the
> >> >>>>>>> grounds up) ?
> >> >>>>>>>
> >> >>>>>>> J.
> >> >>>>>>>
> >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> [email protected]>
> >> wrote:
> >> >>>>>>> >
> >> >>>>>>> > Hi Julien.
> >> >>>>>>> >
> >> >>>>>>> > I reviewed the design doc.
> >> >>>>>>> > The general idea looks good to me, but I have some concerns
> >> that I would like to share.
> >> >>>>>>> >
> >> >>>>>>> > If I understand correctly the proposed design is to fill in
> >> "operators" with self-methods to extract lineage metadata from it, and I
> >> agree with the motivation. If those are decoupled (in a form of
> extractors
> >> in separate package) from operators itself, then the downsides is that
> (as
> >> you mentioned) - extractors will be distributed separately and
> "operators"
> >> logic is out of sync with "lineage extraction" logic by design.
> >> >>>>>>> > Also knowledge about internals of operator spills out of the
> >> operator which is not good at all (at the very least).
> >> >>>>>>> >
> >> >>>>>>> > However, if we make every operator being exposing method to
> >> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
> >> then we will end up with cartesian complexity of supporting in each
> >> provider+operator each backend format.
> >> >>>>>>> >
> >> >>>>>>> > If you say that the goal is that "operators" will always
> >> generate OpenLineage format only and each consumer will convert this
> format
> >> to their own internal representation, well, if they do this then this
> seems
> >> like a working approach. But with the assumption that each consumer will
> >> support it.
> >> >>>>>>> >
> >> >>>>>>> > I think it comes down to the question: is OpenLineage format
> >> enough popular, complete and proper for the lineage metadata that every
> >> consumer will be convinced to support it. We may also consider issues
> like
> >> mismatch of lineage feature parity, e.g. OpenLineage supports
> field-level
> >> lineage but consumer doesn't support (or not at the moment), so we would
> >> prefer lineage metadata transferred to the backend to be slightly
> different
> >> in this case.
> >> >>>>>>> >
> >> >>>>>>> > What do you think about the idea:
> >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> >> agnostic of the specific format, just using entities from big generic
> >> vocabulary of entities e.g. created here
> >> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py
> .
> >> We would have there e.g. entities like:
> >> >>>>>>> >
> >> --------------------------------------------------------------------
> >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >> >>>>>>> > class PostgresTable:
> >> >>>>>>> >     """Airflow lineage entity representing Postgres table."""
> >> >>>>>>> >
> >> >>>>>>> >     host: str = attr.ib()
> >> >>>>>>> >     port: str = attr.ib()
> >> >>>>>>> >     database: str = attr.ib()
> >> >>>>>>> >     schema: str = attr.ib()
> >> >>>>>>> >     table: str = attr.ib()
> >> >>>>>>> >
> >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >> >>>>>>> > class GCSEntity:
> >> >>>>>>> >     """Airflow lineage entity representing generic Google
> Cloud
> >> Storage entity."""
> >> >>>>>>> >
> >> >>>>>>> >     bucket: str = attr.ib()
> >> >>>>>>> >     path: str = attr.ib()
> >> >>>>>>> >
> >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >> >>>>>>> > class AWSS3Entity:
> >> >>>>>>> >     """Airflow lineage entity representing generic AWS S3
> >> entity."""
> >> >>>>>>> >
> >> >>>>>>> >     bucket: str = attr.ib()
> >> >>>>>>> >     path: str = attr.ib()
> >> >>>>>>> >
> >> --------------------------------------------------------------------
> >> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
> >> "operators" and backends. Their responsibility will be to convert
> lineage
> >> metadata generated by "operators" to a format understandable by specific
> >> backend.
> >> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets
> to
> >> bypass Airflow lineage metadata to the Airflow lineage backend.
> >> >>>>>>> >
> >> >>>>>>> > I didn't get exactly implementation details of your proposed
> >> design, but I think maintaining global vocabulary of entities to use in
> >> inlets/outlets of operators is crucial for Airflow, as this could be
> >> leveraged to build various features on top of it, like displaying
> lineage
> >> graph in Airflow UI (based on XCOM):)
> >> >>>>>>> >
> >> >>>>>>> > Importantly to note, if we decide to send out from Airflow
> >> lineage metadata only in OpenLineage format, well, we could have than
> only
> >> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us
> >> room for adding support to others (following "pluggable" approach as
> >> Airflow is mainly known/good about).
> >> >>>>>>> >
> >> >>>>>>> > All in all:
> >> >>>>>>> > - global vocabulary of entities used across all "operators"
> >> (with all advantages out of it, mentioned above)
> >> >>>>>>> > - "adapters" approach
> >> >>>>>>> > seems to me crucial points in the design that make sense to
> me.
> >> >>>>>>> >
> >> >>>>>>> > What do you think about this?
> >> >>>>>>> >
> >> >>>>>>> > - Eugene
> >> >>>>>>> >
> >> >>>>>>> >
> >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> >> <[email protected]> wrote:
> >> >>>>>>> >>
> >> >>>>>>> >> Hello Michał,
> >> >>>>>>> >> Thank you for your input.
> >> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption
> >> about the backend being used to store lineage and is an adapter-like
> layer.
> >> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose
> >> of avoiding the problem of every lineage consumer having to understand
> >> every lineage producer.
> >> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage
> >> from any data transformation layer like Airflow, Spark, Flink, SQL,
> >> Warehouses, ...
> >> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently
> >> of the technology used, so does OpenLineage for lineage.
> >> >>>>>>> >> Julien
> >> >>>>>>> >>
> >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> >> [email protected]> wrote:
> >> >>>>>>> >>>
> >> >>>>>>> >>> Hi everyone,
> >> >>>>>>> >>>
> >> >>>>>>> >>> As Airflow already supports lineage functionality through
> >> pluggable lineage backends, I think OpenLineage and other lineage
> systems
> >> integration should follow this path. I think more 'native' integration
> with
> >> OpenLineage (or any other lineage system) in Airflow while maintaining
> the
> >> generic lineage backend architecture in parallel would make the user
> >> experience less open, troublesome to maintain, and the Airflow
> architecture
> >> itself more constrained by a logic of a specific system.
> >> >>>>>>> >>>
> >> >>>>>>> >>> I think enriching operators with a generic method exposing
> >> lineage metadata that could be leveraged by lineage backends regardless
> of
> >> their implementation is a good idea which the Cloud Composer team would
> >> gladly contribute to. I believe the translation of the Airflow metadata
> >> exposed by the operators should be done by lineage backends (or another
> >> adapter-like layer). Tying Airflow operators' development to a specific
> >> lineage system like OpenLineage forces operators' contributors to
> >> understand that system too, which increases both the entry costs and
> >> maintenance costs. I see it as unnecessary coupling.
> >> >>>>>>> >>>
> >> >>>>>>> >>> Best,
> >> >>>>>>> >>> Michal
> >> >>>>>>> >>>
> >> >>>>>>> >>>
> >> >>>>>>> >>>
> >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> >> [email protected]> wrote:
> >> >>>>>>> >>>>
> >> >>>>>>> >>>> Thank you Eugen,
> >> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and
> I
> >> think this would work well.
> >> >>>>>>> >>>> Here are the sections in the doc that I think address your
> >> points:
> >> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in
> >> each operator, using generic lineage entities
> >> >>>>>>> >>>> See: OpenLineage support in providers. It describes how
> each
> >> operator exposes its lineage.
> >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to
> Data
> >> Lineage format, Open Lineage format, etc.
> >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> format
> >> to their own internal representation as you are suggesting.
> >> >>>>>>> >>>> In the motivation section, towards the end, I link to a few
> >> examples of data catalogs doing just that.
> >> >>>>>>> >>>>
> >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> >> [email protected]> wrote:
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>> ++ Michal Modras
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> >> [email protected]> wrote:
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> >> Dataplex" feature which effectively means to generate lineage out of
> >> DAG/task executions and export it to Data Lineage (Data Catalog service)
> >> for further analysis.
> >> >>>>>>> >>>>>>
> >> https://cloud.google.com/composer/docs/composer-2/lineage-integration
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
> >> backend" feature and methods to extract lineage metadata on task post
> >> execution events.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow
> >> community in a form:
> >> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method
> in
> >> each operator, using generic lineage entities
> >> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to
> >> Data Lineage format, Open Lineage format, etc.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean
> >> to introduce an additional layer of converting from OpenLineage format
> to
> >> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a
> >> possibility.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> >> <[email protected]> wrote:
> >> >>>>>>> >>>>>>>
> >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> >> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc
> >> accordingly.
> >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> >> >>>>>>> >>>>>>> Julien
> >> >>>>>>> >>>>>>>
> >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> >> [email protected]> wrote:
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is
> >> (and should be
> >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
> >> capabilities
> >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> >> working on - Airflow
> >> >>>>>>> >>>>>>>> as a Platform.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes
> >> the same
> >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
> >> goes, where we
> >> >>>>>>> >>>>>>>> might decide to support certain standards in order to
> >> expand
> >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
> >> plug-in multiple
> >> >>>>>>> >>>>>>>> external solutions that would use the standard API.
> >> After Open-Lineage
> >> >>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been
> >> watching this
> >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> candidate
> >> for Airflow
> >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> players
> >> to make use
> >> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it
> >> "officially
> >> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback
> >> from the big
> >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have
> >> such a
> >> >>>>>>> >>>>>>>> capability, and another is to get it used in all the
> >> ways Airflow is
> >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which
> >> is obviously a
> >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow
> >> is exposed by
> >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some
> >> warm words from
> >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
> >> whether the
> >> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the
> >> open-lineage
> >> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
> >> likely more)
> >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly
> other
> >> stakeholders
> >> >>>>>>> >>>>>>>> might want to say something.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
> >> implementing and
> >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that
> >> is the main
> >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to
> make
> >> the
> >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
> >> integrating it in
> >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
> >> verification
> >> >>>>>>> >>>>>>>> process and making some very clear expectations about
> >> what it means
> >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can
> >> make some
> >> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
> >> on-going cost,
> >> >>>>>>> >>>>>>>> while maximising the gain.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help
> >> with all that
> >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even
> >> if it will
> >> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
> >> experts from Open
> >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage
> >> being the core
> >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this
> >> might be the
> >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position
> as
> >> an
> >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> stack".
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
> >> forward to
> >> >>>>>>> >>>>>>>> making it happen :).
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> J.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> >> >>>>>>> >>>>>>>> <[email protected]> wrote:
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> >> OpenLineage provider to Airflow.
> >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
> >> official AIP.
> >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> >> >>>>>>> >>>>>>>> > Thank you,
> >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to
> >> understand dependencies between data pipelines and track end-to-end
> >> provenance of data. It enables many use cases from ensuring reliable
> >> delivery of data through observability to compliance and cost
> management.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
> >> capability to enable troubleshooting and governance.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> >> foundation that provides a spec standardizing operational lineage
> >> collection and sharing across the data ecosystem. If it provides plugins
> >> for popular open source projects, its intent is very similar to
> >> OpenTelemetry (also under the Linux Foundation umbrella): to remain a
> spec
> >> for lineage exchange that projects - open source or proprietary -
> implement.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
> >> easier and more reliable for Airflow users to publish their operational
> >> lineage through the OpenLineage ecosystem.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> >> OpenLineage project depends on Airflow and operators internals and gets
> >> broken when changes are made on those. Having a built-in integration
> >> ensures a better first class support to expose lineage that gets tested
> >> alongside other changes and therefore is more stable.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> --
> >> >>>>>>> >>>>>> Eugene
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>> --
> >> >>>>>>> >>>>> Eugene
> >> >>>>>>> >
> >> >>>>>>> >
> >> >>>>>>> >
> >> >>>>>>> > --
> >> >>>>>>> > Eugene
> >>
> >
>


-- 
Eugene

Reply via email to