+1, would be happy to join the session! (Please add either
ikholo...@google.com or kholopo...@gmail.com).

Best,
Igor

On Wed, Mar 22, 2023 at 11:27 PM Pierre Jeambrun <pierrejb...@gmail.com>
wrote:

> Same here if you can add me please.
>
> Looking forward to this session.
>
> Le mer. 22 mars 2023 à 23:07, Mehta, Shubham <shu...@amazon.com.invalid> a
> écrit :
>
> > Please include me, I will try my best to join (shubhammehta...@gmail.com
> )
> >
> > Best,
> > Shubham
> >
> > On 2023-03-22, 2:24 PM, "Jarek Potiuk" <ja...@potiuk.com <mailto:
> > ja...@potiuk.com>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > There are some strange behaviours in the calendar entry - I think you
> > cannot add yourself, only guests can add others :)
> > I've added you Eugen, maybe if someone wants to be also added - please
> > post here with your gmail/calendar addresses.
> >
> >
> > J.
> >
> >
> > On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eu...@kosteev.com
> <mailto:
> > eu...@kosteev.com>> wrote:
> > >
> > > Hi Julien.
> > >
> > > Can you, please, include me there as well: eu...@kosteev.com <mailto:
> > eu...@kosteev.com> or
> > > kost...@google.com <mailto:kost...@google.com>.
> > > Looking forward to see presentation.
> > >
> > > - Eugene
> > >
> > > On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem
> <jul...@astronomer.io.inva
> > <mailto:jul...@astronomer.io.inva>lid>
> > > wrote:
> > >
> > > > Hello all,
> > > > I have to move the OpenLineage presentation to next week.
> > > > Sorry for the change.
> > > > It will be Friday next week March 31st at 5pm CET 9am PT.
> > > >
> > > >
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> > <
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&amp;tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&amp;tmsrc=julien%40astronomer.io
> > >
> > > > Julien
> > > >
> > > > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <jul...@astronomer.io
> > <mailto:jul...@astronomer.io>>
> > > > wrote:
> > > >
> > > > > We are planning to do this session next Thursday at 5pm CET 9am
> PT. I
> > > > will
> > > > > send a zoom link in advance.
> > > > > Julien
> > > > >
> > > > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com
> > <mailto:ja...@potiuk.com>> wrote:
> > > > >
> > > > >> Cool. I am looking forward to it :). It would be great to get some
> > > > >> insight from those who attempted to get the lineage working in
> > several
> > > > >> versions of Open Lineage and finally arrived at the current
> > > > >> specs/integration.
> > > > >>
> > > > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >
> > > > >> > Thank you Jarek,
> > > > >> > I am happy to organize a zoom presentation about OpenLineage and
> > > > answer
> > > > >> any question. It is indeed a spec decoupling the data
> transformation
> > > > layer
> > > > >> from the Metadata store people are using. Just like OpenTelemetry
> > is for
> > > > >> service metrics/traces.
> > > > >> > Best,
> > > > >> > Julien
> > > > >> >
> > > > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com
> > <mailto:ja...@potiuk.com>>
> > > > wrote:
> > > > >> >>
> > > > >> >> And to add a little "parallel" - I think Open Lineage
> integration
> > > > >> replacing our "generic lineage" is very similar step to the new
> > > > >> "Multi-tenant"-ready authentication interface we are discussing in
> > > > >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> <
> > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck>
> > > > >> >>
> > > > >> >> Yes - we have a generic authentication interface, but no - it's
> > > > >> useless for the case where multi-tenancy and good level of
> resource
> > > > >> authorization is needed. It's just far too simplistic and limited.
> > > > >> >>
> > > > >> >> Same with current lineage generic interface - yes, we have it
> but
> > > > it's
> > > > >> only useful in a limited set of cases. and if we want to
> step-it-up
> > we
> > > > need
> > > > >> to come up with something better (and Open Lineage happens to be
> one
> > > > that
> > > > >> has been developed with Airflow in mind and battle tested).
> > > > >> >>
> > > > >> >> J.
> > > > >> >>
> > > > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com
> > <mailto:ja...@potiuk.com>>
> > > > wrote:
> > > > >> >>>
> > > > >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> > > > >> >>>
> > > > >> >>> I think I know where your/Eugen/Michał concerns are coming
> > from. And
> > > > >> I think it would be great if we can talk it over a bit. I believe
> > this
> > > > is
> > > > >> - in parts - quite a misunderstanding of what Open Lineage really
> > is,
> > > > how
> > > > >> much of an integration it is and what are the reasons why it has
> > been
> > > > >> implemented the way it was implemented in Airflow.
> > > > >> >>>
> > > > >> >>> **Idea**: (Julien - Maybe you can organize it ?):
> > > > >> >>>
> > > > >> >>> Maybe we can have an open-to-everyone presentation/zoom call
> > with
> > > > >> quite some time foreseen to ask questions where you would explain
> > the
> > > > >> community about those integration points (and especially those
> > people
> > > > who
> > > > >> are worried we are losing something by choosing the OpenLineage
> > > > >> integration). I would love to see such a presentation -
> specifically
> > > > >> focused on explaining how Open-Lineage is really improving the
> > current
> > > > >> lineage approach and what problems it solves that the existing
> > generic
> > > > >> interface doesn't.
> > > > >> >>>
> > > > >> >>> Just to set the tone and focus for such meeting if we have
> one:
> > > > >> >>>
> > > > >> >>> For me - when I look at Open Lineage, it is really "this is
> how
> > > > >> lineage generic interface **should** be done in Airflow". The
> > "generic"
> > > > >> lineage support we have now is very, very basic, I'd even say far
> > too
> > > > >> simplistic. I would even say, it's useless besides a few, very
> > basic use
> > > > >> cases. Simply because there was never a good "receiver" of the
> > > > information
> > > > >> to cover those cases.
> > > > >> >>>
> > > > >> >>> When you look closely at OpenLineage, it's nothing more than a
> > > > better
> > > > >> convention of the dictionaries that we send as a metadata, better
> > > > meta-data
> > > > >> in case of SQL operators (Hooks in the future hopefully), allowing
> > > > handling
> > > > >> some cases that current lineage simply cannot. Also what
> > open-lineage
> > > > >> integration with Airflow covers better handling of the lifecycle
> > "task"
> > > > and
> > > > >> "dag" in Airflow to be able to bind lineage data together. That's
> my
> > > > >> understanding of what we get when we integrate OL in.
> > > > >> >>>
> > > > >> >>> I think over the last 2 years Datakin/Astronomer people had
> > worked
> > > > >> out the level of interface that **just works** and if we would
> like
> > to
> > > > get
> > > > >> the lineage information from Airflow as useful as it is in OL, we
> > would
> > > > >> have to anyway implement pretty much all of the things they
> already
> > did.
> > > > >> >>>
> > > > >> >>> I would love (and I think many community members) to take part
> > in
> > > > >> such a call to hear on that particular aspect of the OL
> integration.
> > > > >> >>>
> > > > >> >>> J.
> > > > >> >>>
> > > > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> > > > >> rafalbieg...@google.com.inva <mailto:rafalbieg...@google.com.inva
> >lid>
> > wrote:
> > > > >> >>>>
> > > > >> >>>> Hi,
> > > > >> >>>>
> > > > >> >>>> I second/echo the input provided by Eugene and Michal.
> > > > >> >>>>
> > > > >> >>>> In general, Airflow should provide generic interfaces to
> > lineage
> > > > >> backends so it's easy to configure the one preferred by the user.
> > > > Whether
> > > > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> > > > should
> > > > >> be the user's choice.
> > > > >> >>>>
> > > > >> >>>> We should avoid close integration with any specific lineage
> > backend
> > > > >> due to the reasons already mentioned, i.e. to avoid translations
> > between
> > > > >> lineage backends. Also, we would closely couple one framework
> > (Airflow)
> > > > >> with another one (Open Lineage) - it makes Airflow more complex
> and
> > less
> > > > >> flexible. Loose coupling between lineage backends and Airflow
> seems
> > to
> > > > be
> > > > >> more future-proven.
> > > > >> >>>>
> > > > >> >>>> Regards, Rafal.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>
> > > > >> >>>>> Dear Airflow community,
> > > > >> >>>>> I have transferred the content of the working google doc I
> > shared
> > > > a
> > > > >> few weeks ago to the Airflow confluence:
> > > > >> >>>>>
> > > > >>
> > > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > <
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > >
> > > > >> >>>>> All comments have been answered, I added clarifications to
> > the doc
> > > > >> accordingly and I also added your suggestions to improve the
> > proposal.
> > > > >> >>>>> All that history is linked from the discussion thread link
> in
> > the
> > > > >> confluence doc if you wish to consult it.
> > > > >> >>>>> Thank you all for your feedback and help in the process.
> > > > >> >>>>> Best
> > > > >> >>>>> Julien
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> > > > jul...@astronomer.io <mailto:jul...@astronomer.io>>
> > > > >> wrote:
> > > > >> >>>>>>
> > > > >> >>>>>> Thank you for the email Jarek, and Eugene for your
> > suggestions,
> > > > >> >>>>>> I do agree with Jarek's assessment. I don't have very much
> > to add
> > > > >> to his argument, it is very thoughtful!
> > > > >> >>>>>> OpenLineage was started to avoid the cartesian complexity
> > that
> > > > >> Eugene mentions. There's actually that specific illustration in
> the
> > > > >> OpenLineage doc.
> > > > >> >>>>>> Lineage consumers want to avoid having to understand the
> > lineage
> > > > >> format of each individual observed data transformation layer. And
> > > > >> transformation layers don't want to understand every Metadata
> > store's
> > > > model
> > > > >> and protocol.
> > > > >> >>>>>> Eugene, about your specific proposal about a global
> > vocabulary of
> > > > >> entities, I think it is a great suggestion.
> > > > >> >>>>>> We can map those entities to Datasets in OpenLineage. The
> way
> > > > >> OpenLineage models this is by allowing specific facets attached to
> > > > Dataset.
> > > > >> Facets are pieces of metadata each with their own JsonSchema.
> > > > >> >>>>>> For example a table from a relational database will have a
> > schema
> > > > >> facet when a file in GCS might not.
> > > > >> >>>>>> So I think in Airflow we could have each of the entity
> > classes
> > > > you
> > > > >> describe be used in the get_openlineage_facets*() API in the
> > Operators.
> > > > >> >>>>>> Each of those classes would know what OpenLineage facets
> > they can
> > > > >> expose.
> > > > >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> > > > >> details in a ticket.
> > > > >> >>>>>> Cheers,
> > > > >> >>>>>> Julien
> > > > >> >>>>>>
> > > > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <
> > ja...@potiuk.com <mailto:ja...@potiuk.com>>
> > > > >> wrote:
> > > > >> >>>>>>>
> > > > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's
> > answer
> > > > >> will
> > > > >> >>>>>>> be more thoughtful).
> > > > >> >>>>>>>
> > > > >> >>>>>>> I think you are right to the "agnostic" part. But I have
> one
> > > > >> question
> > > > >> >>>>>>> - what are we considering "agnostic"?
> > > > >> >>>>>>>
> > > > >> >>>>>>> There is no "widespread" standard for lineage (yet). Open
> > > > Lineage
> > > > >> >>>>>>> with its donation to Linux Foundation Data & AI is
> aspiring
> > to
> > > > >> become
> > > > >> >>>>>>> one. And it's a pretty good candidate:
> > > > >> >>>>>>>
> > > > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage
> was
> > only
> > > > >> >>>>>>> published as an API from day one)
> > > > >> >>>>>>> * as of recently, the ownership and governance of Open
> > Lineage
> > > > is
> > > > >> with
> > > > >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/
> <
> > https://lfaidata.foundation/>)
> > > > which
> > > > >> is
> > > > >> >>>>>>> part of "Linux Foundation Project" - well known and
> > respectful
> > > > >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> > > > provides
> > > > >> >>>>>>> governance rules for a big number of well established OSS
> > > > projects
> > > > >> >>>>>>>
> > > > >> >>>>>>> In essence it is the same approach as we already discussed
> > and
> > > > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF
> > which is
> > > > >> in the
> > > > >> >>>>>>> same league as recognition and governance to LFP) (not yet
> > > > >> implemented
> > > > >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> > > > >> developing
> > > > >> >>>>>>> our "own" existing standard but we opted for one that is
> out
> > > > >> there.
> > > > >> >>>>>>> Yes it is a bit more established and popular than Open
> > Lineage
> > > > >> is, but
> > > > >> >>>>>>> i so wish that we chose and implemented it already (and
> > earlier
> > > > >> as not
> > > > >> >>>>>>> having a standard there - except statsd which is really,
> > really
> > > > >> poor)
> > > > >> >>>>>>> has a great impact on Airflow being just "pluggable" in
> > existing
> > > > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it
> soon
> > and
> > > > I
> > > > >> hear
> > > > >> >>>>>>> (and see) there are attempts to do so).
> > > > >> >>>>>>>
> > > > >> >>>>>>> In the case of Open Lineage, the questions are - is there
> an
> > > > >> >>>>>>> alternative of the same caliber? Shall we produce our own
> > > > >> "agnostic
> > > > >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> > > > >> >>>>>>> "airflow-specific" attributes will catch up and many
> > "consumers"
> > > > >> will
> > > > >> >>>>>>> be writing their own conversions to the way they can
> > consume it?
> > > > >> >>>>>>>
> > > > >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> > > > summarized
> > > > >> >>>>>>> here: https://xkcd.com/927/ <https://xkcd.com/927/>
> > > > >> >>>>>>>
> > > > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow
> > might
> > > > be
> > > > >> the
> > > > >> >>>>>>> only one supporting Open Lineage. That might happen.
> Though
> > the
> > > > >> list
> > > > >> >>>>>>> of "consumers" of Open Lineage is already pretty good
> IMHO.
> > Or
> > > > >> maybe -
> > > > >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> > > > >> popularity
> > > > >> >>>>>>> and the fact that there is already competition supporting
> it
> > > > (e.g.
> > > > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick"
> > adoption
> > > > >> of
> > > > >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit
> of
> > the
> > > > >> whole
> > > > >> >>>>>>> ecosystem. I think we have a chance to influence creation
> > of a
> > > > >> new,
> > > > >> >>>>>>> important standard. Much less so, I think if we just
> > provide our
> > > > >> own
> > > > >> >>>>>>> custom solution - with lots and lots of work for others to
> > be
> > > > >> able to
> > > > >> >>>>>>> consume it, no time to properly nurture the API and make
> it
> > > > >> easier to
> > > > >> >>>>>>> implement it (which is undoubtedly what Datakin,
> Astronomer
> > and
> > > > >> now
> > > > >> >>>>>>> LFData & AI run governance main focus is)
> > > > >> >>>>>>>
> > > > >> >>>>>>> Are there other alternatives we should consider ? Do we
> > want to
> > > > >> >>>>>>> develop our own standard (and implement all the
> integrations
> > > > from
> > > > >> the
> > > > >> >>>>>>> grounds up) ?
> > > > >> >>>>>>>
> > > > >> >>>>>>> J.
> > > > >> >>>>>>>
> > > > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> > > > eu...@kosteev.com <mailto:eu...@kosteev.com>>
> > > > >> wrote:
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > Hi Julien.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I reviewed the design doc.
> > > > >> >>>>>>> > The general idea looks good to me, but I have some
> > concerns
> > > > >> that I would like to share.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > If I understand correctly the proposed design is to fill
> > in
> > > > >> "operators" with self-methods to extract lineage metadata from it,
> > and I
> > > > >> agree with the motivation. If those are decoupled (in a form of
> > > > extractors
> > > > >> in separate package) from operators itself, then the downsides is
> > that
> > > > (as
> > > > >> you mentioned) - extractors will be distributed separately and
> > > > "operators"
> > > > >> logic is out of sync with "lineage extraction" logic by design.
> > > > >> >>>>>>> > Also knowledge about internals of operator spills out of
> > the
> > > > >> operator which is not good at all (at the very least).
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > However, if we make every operator being exposing method
> > to
> > > > >> generate lineage metadata of the specific format, e.g. OpenLineage
> > etc.,
> > > > >> then we will end up with cartesian complexity of supporting in
> each
> > > > >> provider+operator each backend format.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > If you say that the goal is that "operators" will always
> > > > >> generate OpenLineage format only and each consumer will convert
> this
> > > > format
> > > > >> to their own internal representation, well, if they do this then
> > this
> > > > seems
> > > > >> like a working approach. But with the assumption that each
> consumer
> > will
> > > > >> support it.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I think it comes down to the question: is OpenLineage
> > format
> > > > >> enough popular, complete and proper for the lineage metadata that
> > every
> > > > >> consumer will be convinced to support it. We may also consider
> > issues
> > > > like
> > > > >> mismatch of lineage feature parity, e.g. OpenLineage supports
> > > > field-level
> > > > >> lineage but consumer doesn't support (or not at the moment), so we
> > would
> > > > >> prefer lineage metadata transferred to the backend to be slightly
> > > > different
> > > > >> in this case.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > What do you think about the idea:
> > > > >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> > > > >> agnostic of the specific format, just using entities from big
> > generic
> > > > >> vocabulary of entities e.g. created here
> > > > >>
> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py
> <
> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py>
> > > > .
> > > > >> We would have there e.g. entities like:
> > > > >> >>>>>>> >
> > > > >>
> --------------------------------------------------------------------
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class PostgresTable:
> > > > >> >>>>>>> > """Airflow lineage entity representing Postgres
> table."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > host: str = attr.ib()
> > > > >> >>>>>>> > port: str = attr.ib()
> > > > >> >>>>>>> > database: str = attr.ib()
> > > > >> >>>>>>> > schema: str = attr.ib()
> > > > >> >>>>>>> > table: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class GCSEntity:
> > > > >> >>>>>>> > """Airflow lineage entity representing generic Google
> > > > Cloud
> > > > >> Storage entity."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > >> >>>>>>> > path: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class AWSS3Entity:
> > > > >> >>>>>>> > """Airflow lineage entity representing generic AWS S3
> > > > >> entity."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > >> >>>>>>> > path: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >>
> --------------------------------------------------------------------
> > > > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge
> between
> > > > >> "operators" and backends. Their responsibility will be to convert
> > > > lineage
> > > > >> metadata generated by "operators" to a format understandable by
> > specific
> > > > >> backend.
> > > > >> >>>>>>> > And then we can use the built-in mechanism of
> > inlets/outlets
> > > > to
> > > > >> bypass Airflow lineage metadata to the Airflow lineage backend.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I didn't get exactly implementation details of your
> > proposed
> > > > >> design, but I think maintaining global vocabulary of entities to
> > use in
> > > > >> inlets/outlets of operators is crucial for Airflow, as this could
> be
> > > > >> leveraged to build various features on top of it, like displaying
> > > > lineage
> > > > >> graph in Airflow UI (based on XCOM):)
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > Importantly to note, if we decide to send out from
> Airflow
> > > > >> lineage metadata only in OpenLineage format, well, we could have
> > than
> > > > only
> > > > >> one "adapter" OpenLineageAdapter. But the "adapters" approach
> > leaves us
> > > > >> room for adding support to others (following "pluggable" approach
> as
> > > > >> Airflow is mainly known/good about).
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > All in all:
> > > > >> >>>>>>> > - global vocabulary of entities used across all
> > "operators"
> > > > >> (with all advantages out of it, mentioned above)
> > > > >> >>>>>>> > - "adapters" approach
> > > > >> >>>>>>> > seems to me crucial points in the design that make sense
> > to
> > > > me.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > What do you think about this?
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > - Eugene
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>>> >>
> > > > >> >>>>>>> >> Hello Michał,
> > > > >> >>>>>>> >> Thank you for your input.
> > > > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any
> > assumption
> > > > >> about the backend being used to store lineage and is an
> adapter-like
> > > > layer.
> > > > >> >>>>>>> >> OpenLineage exists as the spec specifically for that
> > purpose
> > > > >> of avoiding the problem of every lineage consumer having to
> > understand
> > > > >> every lineage producer.
> > > > >> >>>>>>> >> Consumers of lineage want a unified spec consuming
> > lineage
> > > > >> from any data transformation layer like Airflow, Spark, Flink,
> SQL,
> > > > >> Warehouses, ...
> > > > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces
> > independently
> > > > >> of the technology used, so does OpenLineage for lineage.
> > > > >> >>>>>>> >> Julien
> > > > >> >>>>>>> >>
> > > > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> > > > >> michalmod...@google.com <mailto:michalmod...@google.com>> wrote:
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> Hi everyone,
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> As Airflow already supports lineage functionality
> > through
> > > > >> pluggable lineage backends, I think OpenLineage and other lineage
> > > > systems
> > > > >> integration should follow this path. I think more 'native'
> > integration
> > > > with
> > > > >> OpenLineage (or any other lineage system) in Airflow while
> > maintaining
> > > > the
> > > > >> generic lineage backend architecture in parallel would make the
> user
> > > > >> experience less open, troublesome to maintain, and the Airflow
> > > > architecture
> > > > >> itself more constrained by a logic of a specific system.
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> I think enriching operators with a generic method
> > exposing
> > > > >> lineage metadata that could be leveraged by lineage backends
> > regardless
> > > > of
> > > > >> their implementation is a good idea which the Cloud Composer team
> > would
> > > > >> gladly contribute to. I believe the translation of the Airflow
> > metadata
> > > > >> exposed by the operators should be done by lineage backends (or
> > another
> > > > >> adapter-like layer). Tying Airflow operators' development to a
> > specific
> > > > >> lineage system like OpenLineage forces operators' contributors to
> > > > >> understand that system too, which increases both the entry costs
> and
> > > > >> maintenance costs. I see it as unnecessary coupling.
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> Best,
> > > > >> >>>>>>> >>> Michal
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> > > > >> jul...@astronomer.io <mailto:jul...@astronomer.io>> wrote:
> > > > >> >>>>>>> >>>>
> > > > >> >>>>>>> >>>> Thank you Eugen,
> > > > >> >>>>>>> >>>> This sounds very aligned with the goals of
> OpenLineage
> > and
> > > > I
> > > > >> think this would work well.
> > > > >> >>>>>>> >>>> Here are the sections in the doc that I think address
> > your
> > > > >> points:
> > > > >> >>>>>>> >>>> - generalize lineage metadata extraction as
> > self-method in
> > > > >> each operator, using generic lineage entities
> > > > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes
> how
> > > > each
> > > > >> operator exposes its lineage.
> > > > >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata
> to
> > > > Data
> > > > >> Lineage format, Open Lineage format, etc.
> > > > >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> > > > format
> > > > >> to their own internal representation as you are suggesting.
> > > > >> >>>>>>> >>>> In the motivation section, towards the end, I link to
> > a few
> > > > >> examples of data catalogs doing just that.
> > > > >> >>>>>>> >>>>
> > > > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> > > > >> eu...@kosteev.com <mailto:eu...@kosteev.com>> wrote:
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> ++ Michal Modras
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> > > > >> eu...@kosteev.com <mailto:eu...@kosteev.com>> wrote:
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> > > > >> Dataplex" feature which effectively means to generate lineage out
> of
> > > > >> DAG/task executions and export it to Data Lineage (Data Catalog
> > service)
> > > > >> for further analysis.
> > > > >> >>>>>>> >>>>>>
> > > > >>
> > https://cloud.google.com/composer/docs/composer-2/lineage-integration <
> > https://cloud.google.com/composer/docs/composer-2/lineage-integration>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> > > > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow
> > lineage
> > > > >> backend" feature and methods to extract lineage metadata on task
> > post
> > > > >> execution events.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> The general idea was to contribute this to the
> > Airflow
> > > > >> community in a form:
> > > > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as
> > self-method
> > > > in
> > > > >> each operator, using generic lineage entities
> > > > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated
> metadata
> > to
> > > > >> Data Lineage format, Open Lineage format, etc.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer
> would
> > mean
> > > > >> to introduce an additional layer of converting from OpenLineage
> > format
> > > > to
> > > > >> Data Lineage (Data Catalog/Dataplex) format. But this is
> definitely
> > a
> > > > >> possibility.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>>> >>>>>>>
> > > > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> > > > >> >>>>>>> >>>>>>> I am responding in the comments and adding to the
> > doc
> > > > >> accordingly.
> > > > >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> > > > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> > > > >> >>>>>>> >>>>>>> Julien
> > > > >> >>>>>>> >>>>>>>
> > > > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> > > > >> ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> General comment from my side: I think Open
> Lineage
> > is
> > > > >> (and should be
> > > > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands
> > Airflow's
> > > > >> capabilities
> > > > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> > > > >> working on - Airflow
> > > > >> >>>>>>> >>>>>>>> as a Platform.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage
> > goes
> > > > >> the same
> > > > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open
> > Telemetry
> > > > >> goes, where we
> > > > >> >>>>>>> >>>>>>>> might decide to support certain standards in
> order
> > to
> > > > >> expand
> > > > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows
> to
> > > > >> plug-in multiple
> > > > >> >>>>>>> >>>>>>>> external solutions that would use the standard
> API.
> > > > >> After Open-Lineage
> > > > >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've
> > been
> > > > >> watching this
> > > > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> > > > candidate
> > > > >> for Airflow
> > > > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> > > > players
> > > > >> to make use
> > > > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to
> > make it
> > > > >> "officially
> > > > >> >>>>>>> >>>>>>>> supported". I think we have to also get some
> > feedback
> > > > >> from the big
> > > > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to
> > have
> > > > >> such a
> > > > >> >>>>>>> >>>>>>>> capability, and another is to get it used in all
> > the
> > > > >> ways Airflow is
> > > > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users
> > (which
> > > > >> is obviously a
> > > > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where
> > Airflow
> > > > >> is exposed by
> > > > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see
> > some
> > > > >> warm words from
> > > > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to
> hear
> > > > >> whether the
> > > > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in
> using
> > the
> > > > >> open-lineage
> > > > >> >>>>>>> >>>>>>>> information exposed this way in their Data
> Catalog
> > (and
> > > > >> likely more)
> > > > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and
> possibly
> > > > other
> > > > >> stakeholders
> > > > >> >>>>>>> >>>>>>>> might want to say something.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved
> > in
> > > > >> implementing and
> > > > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned,
> > that
> > > > >> is the main
> > > > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like
> to
> > > > make
> > > > >> the
> > > > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart
> and
> > > > >> integrating it in
> > > > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our
> CI,
> > > > >> verification
> > > > >> >>>>>>> >>>>>>>> process and making some very clear expectations
> > about
> > > > >> what it means
> > > > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we
> > can
> > > > >> make some
> > > > >> >>>>>>> >>>>>>>> initial investment in making it happen and
> minimise
> > > > >> on-going cost,
> > > > >> >>>>>>> >>>>>>>> while maximising the gain.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy
> to
> > help
> > > > >> with all that
> > > > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate
> well,
> > even
> > > > >> if it will
> > > > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will
> have
> > > > >> experts from Open
> > > > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open
> > Lineage
> > > > >> being the core
> > > > >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited -
> > this
> > > > >> might be the
> > > > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its
> > position
> > > > as
> > > > >> an
> > > > >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> > > > stack".
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am
> > looking
> > > > >> forward to
> > > > >> >>>>>>> >>>>>>>> making it happen :).
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> J.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> > > > >> >>>>>>> >>>>>>>> <jul...@astronomer.io.inva <mailto:
> > jul...@astronomer.io.inva>lid> wrote:
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> > > > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> > > > >> OpenLineage provider to Airflow.
> > > > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post
> > an
> > > > >> official AIP.
> > > > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> > > > >> >>>>>>> >>>>>>>> > Thank you,
> > > > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the
> > doc:
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need
> > to
> > > > >> understand dependencies between data pipelines and track
> end-to-end
> > > > >> provenance of data. It enables many use cases from ensuring
> reliable
> > > > >> delivery of data through observability to compliance and cost
> > > > management.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core
> Airflow
> > > > >> capability to enable troubleshooting and governance.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> > > > >> foundation that provides a spec standardizing operational lineage
> > > > >> collection and sharing across the data ecosystem. If it provides
> > plugins
> > > > >> for popular open source projects, its intent is very similar to
> > > > >> OpenTelemetry (also under the Linux Foundation umbrella): to
> remain
> > a
> > > > spec
> > > > >> for lineage exchange that projects - open source or proprietary -
> > > > implement.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will
> > make it
> > > > >> easier and more reliable for Airflow users to publish their
> > operational
> > > > >> lineage through the OpenLineage ecosystem.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> > > > >> OpenLineage project depends on Airflow and operators internals and
> > gets
> > > > >> broken when changes are made on those. Having a built-in
> integration
> > > > >> ensures a better first class support to expose lineage that gets
> > tested
> > > > >> alongside other changes and therefore is more stable.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> --
> > > > >> >>>>>>> >>>>>> Eugene
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> --
> > > > >> >>>>>>> >>>>> Eugene
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > --
> > > > >> >>>>>>> > Eugene
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > Eugene
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org <mailto:
> > dev-unsubscr...@airflow.apache.org>
> > For additional commands, e-mail: dev-h...@airflow.apache.org <mailto:
> > dev-h...@airflow.apache.org>
> >
> >
> >
> >
> >
> >
>

Reply via email to