I'd like to join as well! (mesmaco...@gmail.com) On Thu, 23 Mar 2023 at 19:23 Oliveira, Niko <oniko...@amazon.com.invalid> wrote:
> I'd like to join as well! (oliveira...@gmail.com) > > ________________________________ > From: Igor Kholopov <ikholo...@google.com.INVALID> > Sent: Wednesday, March 22, 2023 4:01:40 PM > To: dev@airflow.apache.org > Subject: RE: [EXTERNAL]Request for feedback on proposal for new > OpenLineage provider in Airflow > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > +1, would be happy to join the session! (Please add either > ikholo...@google.com or kholopo...@gmail.com). > > Best, > Igor > > On Wed, Mar 22, 2023 at 11:27 PM Pierre Jeambrun <pierrejb...@gmail.com> > wrote: > > > Same here if you can add me please. > > > > Looking forward to this session. > > > > Le mer. 22 mars 2023 à 23:07, Mehta, Shubham <shu...@amazon.com.invalid> > a > > écrit : > > > > > Please include me, I will try my best to join ( > shubhammehta...@gmail.com > > ) > > > > > > Best, > > > Shubham > > > > > > On 2023-03-22, 2:24 PM, "Jarek Potiuk" <ja...@potiuk.com <mailto: > > > ja...@potiuk.com>> wrote: > > > > > > > > > CAUTION: This email originated from outside of the organization. Do not > > > click links or open attachments unless you can confirm the sender and > > know > > > the content is safe. > > > > > > > > > > > > > > > > > > > > > There are some strange behaviours in the calendar entry - I think you > > > cannot add yourself, only guests can add others :) > > > I've added you Eugen, maybe if someone wants to be also added - please > > > post here with your gmail/calendar addresses. > > > > > > > > > J. > > > > > > > > > On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eu...@kosteev.com > > <mailto: > > > eu...@kosteev.com>> wrote: > > > > > > > > Hi Julien. > > > > > > > > Can you, please, include me there as well: eu...@kosteev.com > <mailto: > > > eu...@kosteev.com> or > > > > kost...@google.com <mailto:kost...@google.com>. > > > > Looking forward to see presentation. > > > > > > > > - Eugene > > > > > > > > On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem > > <jul...@astronomer.io.inva > > > <mailto:jul...@astronomer.io.inva>lid> > > > > wrote: > > > > > > > > > Hello all, > > > > > I have to move the OpenLineage presentation to next week. > > > > > Sorry for the change. > > > > > It will be Friday next week March 31st at 5pm CET 9am PT. > > > > > > > > > > > > > > > > https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io > > > < > > > > > > https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io > > > > > > > > > Julien > > > > > > > > > > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem < > jul...@astronomer.io > > > <mailto:jul...@astronomer.io>> > > > > > wrote: > > > > > > > > > > > We are planning to do this session next Thursday at 5pm CET 9am > > PT. I > > > > > will > > > > > > send a zoom link in advance. > > > > > > Julien > > > > > > > > > > > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com > > > <mailto:ja...@potiuk.com>> wrote: > > > > > > > > > > > >> Cool. I am looking forward to it :). It would be great to get > some > > > > > >> insight from those who attempted to get the lineage working in > > > several > > > > > >> versions of Open Lineage and finally arrived at the current > > > > > >> specs/integration. > > > > > >> > > > > > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem > > > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva > >lid> > > > wrote: > > > > > >> > > > > > > >> > Thank you Jarek, > > > > > >> > I am happy to organize a zoom presentation about OpenLineage > and > > > > > answer > > > > > >> any question. It is indeed a spec decoupling the data > > transformation > > > > > layer > > > > > >> from the Metadata store people are using. Just like > OpenTelemetry > > > is for > > > > > >> service metrics/traces. > > > > > >> > Best, > > > > > >> > Julien > > > > > >> > > > > > > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk < > ja...@potiuk.com > > > <mailto:ja...@potiuk.com>> > > > > > wrote: > > > > > >> >> > > > > > >> >> And to add a little "parallel" - I think Open Lineage > > integration > > > > > >> replacing our "generic lineage" is very similar step to the new > > > > > >> "Multi-tenant"-ready authentication interface we are discussing > in > > > > > >> > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck > > < > > > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck> > > > > > >> >> > > > > > >> >> Yes - we have a generic authentication interface, but no - > it's > > > > > >> useless for the case where multi-tenancy and good level of > > resource > > > > > >> authorization is needed. It's just far too simplistic and > limited. > > > > > >> >> > > > > > >> >> Same with current lineage generic interface - yes, we have it > > but > > > > > it's > > > > > >> only useful in a limited set of cases. and if we want to > > step-it-up > > > we > > > > > need > > > > > >> to come up with something better (and Open Lineage happens to be > > one > > > > > that > > > > > >> has been developed with Airflow in mind and battle tested). > > > > > >> >> > > > > > >> >> J. > > > > > >> >> > > > > > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk < > ja...@potiuk.com > > > <mailto:ja...@potiuk.com>> > > > > > wrote: > > > > > >> >>> > > > > > >> >>> Hey Rafał (Eugene, Michal - and others who are looking), > > > > > >> >>> > > > > > >> >>> I think I know where your/Eugen/Michał concerns are coming > > > from. And > > > > > >> I think it would be great if we can talk it over a bit. I > believe > > > this > > > > > is > > > > > >> - in parts - quite a misunderstanding of what Open Lineage > really > > > is, > > > > > how > > > > > >> much of an integration it is and what are the reasons why it has > > > been > > > > > >> implemented the way it was implemented in Airflow. > > > > > >> >>> > > > > > >> >>> **Idea**: (Julien - Maybe you can organize it ?): > > > > > >> >>> > > > > > >> >>> Maybe we can have an open-to-everyone presentation/zoom call > > > with > > > > > >> quite some time foreseen to ask questions where you would > explain > > > the > > > > > >> community about those integration points (and especially those > > > people > > > > > who > > > > > >> are worried we are losing something by choosing the OpenLineage > > > > > >> integration). I would love to see such a presentation - > > specifically > > > > > >> focused on explaining how Open-Lineage is really improving the > > > current > > > > > >> lineage approach and what problems it solves that the existing > > > generic > > > > > >> interface doesn't. > > > > > >> >>> > > > > > >> >>> Just to set the tone and focus for such meeting if we have > > one: > > > > > >> >>> > > > > > >> >>> For me - when I look at Open Lineage, it is really "this is > > how > > > > > >> lineage generic interface **should** be done in Airflow". The > > > "generic" > > > > > >> lineage support we have now is very, very basic, I'd even say > far > > > too > > > > > >> simplistic. I would even say, it's useless besides a few, very > > > basic use > > > > > >> cases. Simply because there was never a good "receiver" of the > > > > > information > > > > > >> to cover those cases. > > > > > >> >>> > > > > > >> >>> When you look closely at OpenLineage, it's nothing more > than a > > > > > better > > > > > >> convention of the dictionaries that we send as a metadata, > better > > > > > meta-data > > > > > >> in case of SQL operators (Hooks in the future hopefully), > allowing > > > > > handling > > > > > >> some cases that current lineage simply cannot. Also what > > > open-lineage > > > > > >> integration with Airflow covers better handling of the lifecycle > > > "task" > > > > > and > > > > > >> "dag" in Airflow to be able to bind lineage data together. > That's > > my > > > > > >> understanding of what we get when we integrate OL in. > > > > > >> >>> > > > > > >> >>> I think over the last 2 years Datakin/Astronomer people had > > > worked > > > > > >> out the level of interface that **just works** and if we would > > like > > > to > > > > > get > > > > > >> the lineage information from Airflow as useful as it is in OL, > we > > > would > > > > > >> have to anyway implement pretty much all of the things they > > already > > > did. > > > > > >> >>> > > > > > >> >>> I would love (and I think many community members) to take > part > > > in > > > > > >> such a call to hear on that particular aspect of the OL > > integration. > > > > > >> >>> > > > > > >> >>> J. > > > > > >> >>> > > > > > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz < > > > > > >> rafalbieg...@google.com.inva <mailto: > rafalbieg...@google.com.inva > > >lid> > > > wrote: > > > > > >> >>>> > > > > > >> >>>> Hi, > > > > > >> >>>> > > > > > >> >>>> I second/echo the input provided by Eugene and Michal. > > > > > >> >>>> > > > > > >> >>>> In general, Airflow should provide generic interfaces to > > > lineage > > > > > >> backends so it's easy to configure the one preferred by the > user. > > > > > Whether > > > > > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. > it > > > > > should > > > > > >> be the user's choice. > > > > > >> >>>> > > > > > >> >>>> We should avoid close integration with any specific lineage > > > backend > > > > > >> due to the reasons already mentioned, i.e. to avoid translations > > > between > > > > > >> lineage backends. Also, we would closely couple one framework > > > (Airflow) > > > > > >> with another one (Open Lineage) - it makes Airflow more complex > > and > > > less > > > > > >> flexible. Loose coupling between lineage backends and Airflow > > seems > > > to > > > > > be > > > > > >> more future-proven. > > > > > >> >>>> > > > > > >> >>>> Regards, Rafal. > > > > > >> >>>> > > > > > >> >>>> > > > > > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem > > > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva > >lid> > > > wrote: > > > > > >> >>>>> > > > > > >> >>>>> Dear Airflow community, > > > > > >> >>>>> I have transferred the content of the working google doc I > > > shared > > > > > a > > > > > >> few weeks ago to the Airflow confluence: > > > > > >> >>>>> > > > > > >> > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow > > > < > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow > > > > > > > > > >> >>>>> All comments have been answered, I added clarifications to > > > the doc > > > > > >> accordingly and I also added your suggestions to improve the > > > proposal. > > > > > >> >>>>> All that history is linked from the discussion thread link > > in > > > the > > > > > >> confluence doc if you wish to consult it. > > > > > >> >>>>> Thank you all for your feedback and help in the process. > > > > > >> >>>>> Best > > > > > >> >>>>> Julien > > > > > >> >>>>> > > > > > >> >>>>> > > > > > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem < > > > > > jul...@astronomer.io <mailto:jul...@astronomer.io>> > > > > > >> wrote: > > > > > >> >>>>>> > > > > > >> >>>>>> Thank you for the email Jarek, and Eugene for your > > > suggestions, > > > > > >> >>>>>> I do agree with Jarek's assessment. I don't have very > much > > > to add > > > > > >> to his argument, it is very thoughtful! > > > > > >> >>>>>> OpenLineage was started to avoid the cartesian complexity > > > that > > > > > >> Eugene mentions. There's actually that specific illustration in > > the > > > > > >> OpenLineage doc. > > > > > >> >>>>>> Lineage consumers want to avoid having to understand the > > > lineage > > > > > >> format of each individual observed data transformation layer. > And > > > > > >> transformation layers don't want to understand every Metadata > > > store's > > > > > model > > > > > >> and protocol. > > > > > >> >>>>>> Eugene, about your specific proposal about a global > > > vocabulary of > > > > > >> entities, I think it is a great suggestion. > > > > > >> >>>>>> We can map those entities to Datasets in OpenLineage. The > > way > > > > > >> OpenLineage models this is by allowing specific facets attached > to > > > > > Dataset. > > > > > >> Facets are pieces of metadata each with their own JsonSchema. > > > > > >> >>>>>> For example a table from a relational database will have > a > > > schema > > > > > >> facet when a file in GCS might not. > > > > > >> >>>>>> So I think in Airflow we could have each of the entity > > > classes > > > > > you > > > > > >> describe be used in the get_openlineage_facets*() API in the > > > Operators. > > > > > >> >>>>>> Each of those classes would know what OpenLineage facets > > > they can > > > > > >> expose. > > > > > >> >>>>>> I'll add a mention in the AIP and I think we can go in > more > > > > > >> details in a ticket. > > > > > >> >>>>>> Cheers, > > > > > >> >>>>>> Julien > > > > > >> >>>>>> > > > > > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk < > > > ja...@potiuk.com <mailto:ja...@potiuk.com>> > > > > > >> wrote: > > > > > >> >>>>>>> > > > > > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's > > > answer > > > > > >> will > > > > > >> >>>>>>> be more thoughtful). > > > > > >> >>>>>>> > > > > > >> >>>>>>> I think you are right to the "agnostic" part. But I have > > one > > > > > >> question > > > > > >> >>>>>>> - what are we considering "agnostic"? > > > > > >> >>>>>>> > > > > > >> >>>>>>> There is no "widespread" standard for lineage (yet). > Open > > > > > Lineage > > > > > >> >>>>>>> with its donation to Linux Foundation Data & AI is > > aspiring > > > to > > > > > >> become > > > > > >> >>>>>>> one. And it's a pretty good candidate: > > > > > >> >>>>>>> > > > > > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage > > was > > > only > > > > > >> >>>>>>> published as an API from day one) > > > > > >> >>>>>>> * as of recently, the ownership and governance of Open > > > Lineage > > > > > is > > > > > >> with > > > > > >> >>>>>>> Linux Foundation Data & AI ( > https://lfaidata.foundation/ > > < > > > https://lfaidata.foundation/>) > > > > > which > > > > > >> is > > > > > >> >>>>>>> part of "Linux Foundation Project" - well known and > > > respectful > > > > > >> >>>>>>> foundation that - similarly to the ASF is an umbrella > and > > > > > provides > > > > > >> >>>>>>> governance rules for a big number of well established > OSS > > > > > projects > > > > > >> >>>>>>> > > > > > >> >>>>>>> In essence it is the same approach as we already > discussed > > > and > > > > > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF > > > which is > > > > > >> in the > > > > > >> >>>>>>> same league as recognition and governance to LFP) (not > yet > > > > > >> implemented > > > > > >> >>>>>>> though). In the case of Open-Telemetry, we decided > against > > > > > >> developing > > > > > >> >>>>>>> our "own" existing standard but we opted for one that is > > out > > > > > >> there. > > > > > >> >>>>>>> Yes it is a bit more established and popular than Open > > > Lineage > > > > > >> is, but > > > > > >> >>>>>>> i so wish that we chose and implemented it already (and > > > earlier > > > > > >> as not > > > > > >> >>>>>>> having a standard there - except statsd which is really, > > > really > > > > > >> poor) > > > > > >> >>>>>>> has a great impact on Airflow being just "pluggable" in > > > existing > > > > > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it > > soon > > > and > > > > > I > > > > > >> hear > > > > > >> >>>>>>> (and see) there are attempts to do so). > > > > > >> >>>>>>> > > > > > >> >>>>>>> In the case of Open Lineage, the questions are - is > there > > an > > > > > >> >>>>>>> alternative of the same caliber? Shall we produce our > own > > > > > >> "agnostic > > > > > >> >>>>>>> standard" for it instead ? Is there a chance the idea of > > > > > >> >>>>>>> "airflow-specific" attributes will catch up and many > > > "consumers" > > > > > >> will > > > > > >> >>>>>>> be writing their own conversions to the way they can > > > consume it? > > > > > >> >>>>>>> > > > > > >> >>>>>>> I would really, really try to avoid the pitfalls nicely > > > > > summarized > > > > > >> >>>>>>> here: https://xkcd.com/927/ <https://xkcd.com/927/> > > > > > >> >>>>>>> > > > > > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow > > > might > > > > > be > > > > > >> the > > > > > >> >>>>>>> only one supporting Open Lineage. That might happen. > > Though > > > the > > > > > >> list > > > > > >> >>>>>>> of "consumers" of Open Lineage is already pretty good > > IMHO. > > > Or > > > > > >> maybe - > > > > > >> >>>>>>> more likely - once Airflow implements it, due to > Airflow's > > > > > >> popularity > > > > > >> >>>>>>> and the fact that there is already competition > supporting > > it > > > > > (e.g. > > > > > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" > > > adoption > > > > > >> of > > > > > >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit > > of > > > the > > > > > >> whole > > > > > >> >>>>>>> ecosystem. I think we have a chance to influence > creation > > > of a > > > > > >> new, > > > > > >> >>>>>>> important standard. Much less so, I think if we just > > > provide our > > > > > >> own > > > > > >> >>>>>>> custom solution - with lots and lots of work for others > to > > > be > > > > > >> able to > > > > > >> >>>>>>> consume it, no time to properly nurture the API and make > > it > > > > > >> easier to > > > > > >> >>>>>>> implement it (which is undoubtedly what Datakin, > > Astronomer > > > and > > > > > >> now > > > > > >> >>>>>>> LFData & AI run governance main focus is) > > > > > >> >>>>>>> > > > > > >> >>>>>>> Are there other alternatives we should consider ? Do we > > > want to > > > > > >> >>>>>>> develop our own standard (and implement all the > > integrations > > > > > from > > > > > >> the > > > > > >> >>>>>>> grounds up) ? > > > > > >> >>>>>>> > > > > > >> >>>>>>> J. > > > > > >> >>>>>>> > > > > > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev < > > > > > eu...@kosteev.com <mailto:eu...@kosteev.com>> > > > > > >> wrote: > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > Hi Julien. > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > I reviewed the design doc. > > > > > >> >>>>>>> > The general idea looks good to me, but I have some > > > concerns > > > > > >> that I would like to share. > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > If I understand correctly the proposed design is to > fill > > > in > > > > > >> "operators" with self-methods to extract lineage metadata from > it, > > > and I > > > > > >> agree with the motivation. If those are decoupled (in a form of > > > > > extractors > > > > > >> in separate package) from operators itself, then the downsides > is > > > that > > > > > (as > > > > > >> you mentioned) - extractors will be distributed separately and > > > > > "operators" > > > > > >> logic is out of sync with "lineage extraction" logic by design. > > > > > >> >>>>>>> > Also knowledge about internals of operator spills out > of > > > the > > > > > >> operator which is not good at all (at the very least). > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > However, if we make every operator being exposing > method > > > to > > > > > >> generate lineage metadata of the specific format, e.g. > OpenLineage > > > etc., > > > > > >> then we will end up with cartesian complexity of supporting in > > each > > > > > >> provider+operator each backend format. > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > If you say that the goal is that "operators" will > always > > > > > >> generate OpenLineage format only and each consumer will convert > > this > > > > > format > > > > > >> to their own internal representation, well, if they do this then > > > this > > > > > seems > > > > > >> like a working approach. But with the assumption that each > > consumer > > > will > > > > > >> support it. > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > I think it comes down to the question: is OpenLineage > > > format > > > > > >> enough popular, complete and proper for the lineage metadata > that > > > every > > > > > >> consumer will be convinced to support it. We may also consider > > > issues > > > > > like > > > > > >> mismatch of lineage feature parity, e.g. OpenLineage supports > > > > > field-level > > > > > >> lineage but consumer doesn't support (or not at the moment), so > we > > > would > > > > > >> prefer lineage metadata transferred to the backend to be > slightly > > > > > different > > > > > >> in this case. > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > What do you think about the idea: > > > > > >> >>>>>>> > 1. make lineage metadata generated by "operators" to > be > > > > > >> agnostic of the specific format, just using entities from big > > > generic > > > > > >> vocabulary of entities e.g. created here > > > > > >> > > > > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py > > < > > > > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py> > > > > > . > > > > > >> We would have there e.g. entities like: > > > > > >> >>>>>>> > > > > > > >> > > -------------------------------------------------------------------- > > > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) > > > > > >> >>>>>>> > class PostgresTable: > > > > > >> >>>>>>> > """Airflow lineage entity representing Postgres > > table.""" > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > host: str = attr.ib() > > > > > >> >>>>>>> > port: str = attr.ib() > > > > > >> >>>>>>> > database: str = attr.ib() > > > > > >> >>>>>>> > schema: str = attr.ib() > > > > > >> >>>>>>> > table: str = attr.ib() > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) > > > > > >> >>>>>>> > class GCSEntity: > > > > > >> >>>>>>> > """Airflow lineage entity representing generic Google > > > > > Cloud > > > > > >> Storage entity.""" > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > bucket: str = attr.ib() > > > > > >> >>>>>>> > path: str = attr.ib() > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) > > > > > >> >>>>>>> > class AWSS3Entity: > > > > > >> >>>>>>> > """Airflow lineage entity representing generic AWS S3 > > > > > >> entity.""" > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > bucket: str = attr.ib() > > > > > >> >>>>>>> > path: str = attr.ib() > > > > > >> >>>>>>> > > > > > > >> > > -------------------------------------------------------------------- > > > > > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge > > between > > > > > >> "operators" and backends. Their responsibility will be to > convert > > > > > lineage > > > > > >> metadata generated by "operators" to a format understandable by > > > specific > > > > > >> backend. > > > > > >> >>>>>>> > And then we can use the built-in mechanism of > > > inlets/outlets > > > > > to > > > > > >> bypass Airflow lineage metadata to the Airflow lineage backend. > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > I didn't get exactly implementation details of your > > > proposed > > > > > >> design, but I think maintaining global vocabulary of entities to > > > use in > > > > > >> inlets/outlets of operators is crucial for Airflow, as this > could > > be > > > > > >> leveraged to build various features on top of it, like > displaying > > > > > lineage > > > > > >> graph in Airflow UI (based on XCOM):) > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > Importantly to note, if we decide to send out from > > Airflow > > > > > >> lineage metadata only in OpenLineage format, well, we could have > > > than > > > > > only > > > > > >> one "adapter" OpenLineageAdapter. But the "adapters" approach > > > leaves us > > > > > >> room for adding support to others (following "pluggable" > approach > > as > > > > > >> Airflow is mainly known/good about). > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > All in all: > > > > > >> >>>>>>> > - global vocabulary of entities used across all > > > "operators" > > > > > >> (with all advantages out of it, mentioned above) > > > > > >> >>>>>>> > - "adapters" approach > > > > > >> >>>>>>> > seems to me crucial points in the design that make > sense > > > to > > > > > me. > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > What do you think about this? > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > - Eugene > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem > > > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva > >lid> > > > wrote: > > > > > >> >>>>>>> >> > > > > > >> >>>>>>> >> Hello Michał, > > > > > >> >>>>>>> >> Thank you for your input. > > > > > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any > > > assumption > > > > > >> about the backend being used to store lineage and is an > > adapter-like > > > > > layer. > > > > > >> >>>>>>> >> OpenLineage exists as the spec specifically for that > > > purpose > > > > > >> of avoiding the problem of every lineage consumer having to > > > understand > > > > > >> every lineage producer. > > > > > >> >>>>>>> >> Consumers of lineage want a unified spec consuming > > > lineage > > > > > >> from any data transformation layer like Airflow, Spark, Flink, > > SQL, > > > > > >> Warehouses, ... > > > > > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces > > > independently > > > > > >> of the technology used, so does OpenLineage for lineage. > > > > > >> >>>>>>> >> Julien > > > > > >> >>>>>>> >> > > > > > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras < > > > > > >> michalmod...@google.com <mailto:michalmod...@google.com>> > wrote: > > > > > >> >>>>>>> >>> > > > > > >> >>>>>>> >>> Hi everyone, > > > > > >> >>>>>>> >>> > > > > > >> >>>>>>> >>> As Airflow already supports lineage functionality > > > through > > > > > >> pluggable lineage backends, I think OpenLineage and other > lineage > > > > > systems > > > > > >> integration should follow this path. I think more 'native' > > > integration > > > > > with > > > > > >> OpenLineage (or any other lineage system) in Airflow while > > > maintaining > > > > > the > > > > > >> generic lineage backend architecture in parallel would make the > > user > > > > > >> experience less open, troublesome to maintain, and the Airflow > > > > > architecture > > > > > >> itself more constrained by a logic of a specific system. > > > > > >> >>>>>>> >>> > > > > > >> >>>>>>> >>> I think enriching operators with a generic method > > > exposing > > > > > >> lineage metadata that could be leveraged by lineage backends > > > regardless > > > > > of > > > > > >> their implementation is a good idea which the Cloud Composer > team > > > would > > > > > >> gladly contribute to. I believe the translation of the Airflow > > > metadata > > > > > >> exposed by the operators should be done by lineage backends (or > > > another > > > > > >> adapter-like layer). Tying Airflow operators' development to a > > > specific > > > > > >> lineage system like OpenLineage forces operators' contributors > to > > > > > >> understand that system too, which increases both the entry costs > > and > > > > > >> maintenance costs. I see it as unnecessary coupling. > > > > > >> >>>>>>> >>> > > > > > >> >>>>>>> >>> Best, > > > > > >> >>>>>>> >>> Michal > > > > > >> >>>>>>> >>> > > > > > >> >>>>>>> >>> > > > > > >> >>>>>>> >>> > > > > > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem < > > > > > >> jul...@astronomer.io <mailto:jul...@astronomer.io>> wrote: > > > > > >> >>>>>>> >>>> > > > > > >> >>>>>>> >>>> Thank you Eugen, > > > > > >> >>>>>>> >>>> This sounds very aligned with the goals of > > OpenLineage > > > and > > > > > I > > > > > >> think this would work well. > > > > > >> >>>>>>> >>>> Here are the sections in the doc that I think > address > > > your > > > > > >> points: > > > > > >> >>>>>>> >>>> - generalize lineage metadata extraction as > > > self-method in > > > > > >> each operator, using generic lineage entities > > > > > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes > > how > > > > > each > > > > > >> operator exposes its lineage. > > > > > >> >>>>>>> >>>> - implement "adapter"s to convert generated > metadata > > to > > > > > Data > > > > > >> Lineage format, Open Lineage format, etc. > > > > > >> >>>>>>> >>>> The goal here is each consumer turns from > OpenLineage > > > > > format > > > > > >> to their own internal representation as you are suggesting. > > > > > >> >>>>>>> >>>> In the motivation section, towards the end, I link > to > > > a few > > > > > >> examples of data catalogs doing just that. > > > > > >> >>>>>>> >>>> > > > > > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev < > > > > > >> eu...@kosteev.com <mailto:eu...@kosteev.com>> wrote: > > > > > >> >>>>>>> >>>>> > > > > > >> >>>>>>> >>>>> ++ Michal Modras > > > > > >> >>>>>>> >>>>> > > > > > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev < > > > > > >> eu...@kosteev.com <mailto:eu...@kosteev.com>> wrote: > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage > with > > > > > >> Dataplex" feature which effectively means to generate lineage > out > > of > > > > > >> DAG/task executions and export it to Data Lineage (Data Catalog > > > service) > > > > > >> for further analysis. > > > > > >> >>>>>>> >>>>>> > > > > > >> > > > https://cloud.google.com/composer/docs/composer-2/lineage-integration > < > > > https://cloud.google.com/composer/docs/composer-2/lineage-integration> > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state. > > > > > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow > > > lineage > > > > > >> backend" feature and methods to extract lineage metadata on task > > > post > > > > > >> execution events. > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> The general idea was to contribute this to the > > > Airflow > > > > > >> community in a form: > > > > > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as > > > self-method > > > > > in > > > > > >> each operator, using generic lineage entities > > > > > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated > > metadata > > > to > > > > > >> Data Lineage format, Open Lineage format, etc. > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer > > would > > > mean > > > > > >> to introduce an additional layer of converting from OpenLineage > > > format > > > > > to > > > > > >> Data Lineage (Data Catalog/Dataplex) format. But this is > > definitely > > > a > > > > > >> possibility. > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem > > > > > >> <jul...@astronomer.io.inva <mailto:jul...@astronomer.io.inva > >lid> > > > wrote: > > > > > >> >>>>>>> >>>>>>> > > > > > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek. > > > > > >> >>>>>>> >>>>>>> I am responding in the comments and adding to > the > > > doc > > > > > >> accordingly. > > > > > >> >>>>>>> >>>>>>> I would also love to hear from more > stakeholders. > > > > > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far. > > > > > >> >>>>>>> >>>>>>> Julien > > > > > >> >>>>>>> >>>>>>> > > > > > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk < > > > > > >> ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote: > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> General comment from my side: I think Open > > Lineage > > > is > > > > > >> (and should be > > > > > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands > > > Airflow's > > > > > >> capabilities > > > > > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been > all > > > > > >> working on - Airflow > > > > > >> >>>>>>> >>>>>>>> as a Platform. > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> I think closely integrating it with > Open-Lineage > > > goes > > > > > >> the same > > > > > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open > > > Telemetry > > > > > >> goes, where we > > > > > >> >>>>>>> >>>>>>>> might decide to support certain standards in > > order > > > to > > > > > >> expand > > > > > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and > allows > > to > > > > > >> plug-in multiple > > > > > >> >>>>>>> >>>>>>>> external solutions that would use the standard > > API. > > > > > >> After Open-Lineage > > > > > >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation > (I've > > > been > > > > > >> watching this > > > > > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect > > > > > candidate > > > > > >> for Airflow > > > > > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all > the > > > > > players > > > > > >> to make use > > > > > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to > > > make it > > > > > >> "officially > > > > > >> >>>>>>> >>>>>>>> supported". I think we have to also get some > > > feedback > > > > > >> from the big > > > > > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is > to > > > have > > > > > >> such a > > > > > >> >>>>>>> >>>>>>>> capability, and another is to get it used in > all > > > the > > > > > >> ways Airflow is > > > > > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users > > > (which > > > > > >> is obviously a > > > > > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where > > > Airflow > > > > > >> is exposed by > > > > > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we > see > > > some > > > > > >> warm words from > > > > > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to > > hear > > > > > >> whether the > > > > > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in > > using > > > the > > > > > >> open-lineage > > > > > >> >>>>>>> >>>>>>>> information exposed this way in their Data > > Catalog > > > (and > > > > > >> likely more) > > > > > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and > > possibly > > > > > other > > > > > >> stakeholders > > > > > >> >>>>>>> >>>>>>>> might want to say something. > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort > involved > > > in > > > > > >> implementing and > > > > > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian > mentioned, > > > that > > > > > >> is the main > > > > > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would > like > > to > > > > > make > > > > > >> the > > > > > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart > > and > > > > > >> integrating it in > > > > > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our > > CI, > > > > > >> verification > > > > > >> >>>>>>> >>>>>>>> process and making some very clear expectations > > > about > > > > > >> what it means > > > > > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, > we > > > can > > > > > >> make some > > > > > >> >>>>>>> >>>>>>>> initial investment in making it happen and > > minimise > > > > > >> on-going cost, > > > > > >> >>>>>>> >>>>>>>> while maximising the gain. > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy > > to > > > help > > > > > >> with all that > > > > > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate > > well, > > > even > > > > > >> if it will > > > > > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will > > have > > > > > >> experts from Open > > > > > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open > > > Lineage > > > > > >> being the core > > > > > >> >>>>>>> >>>>>>>> part of the effort. I am actually super > excited - > > > this > > > > > >> might be the > > > > > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its > > > position > > > > > as > > > > > >> an > > > > > >> >>>>>>> >>>>>>>> indispensable component of "even more modern > data > > > > > stack". > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am > > > looking > > > > > >> forward to > > > > > >> >>>>>>> >>>>>>>> making it happen :). > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> J. > > > > > >> >>>>>>> >>>>>>>> > > > > > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem > > > > > >> >>>>>>> >>>>>>>> <jul...@astronomer.io.inva <mailto: > > > jul...@astronomer.io.inva>lid> wrote: > > > > > >> >>>>>>> >>>>>>>> > > > > > > >> >>>>>>> >>>>>>>> > Dear Airflow Community, > > > > > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an > > > > > >> OpenLineage provider to Airflow. > > > > > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to > post > > > an > > > > > >> official AIP. > > > > > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above. > > > > > >> >>>>>>> >>>>>>>> > Thank you, > > > > > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead) > > > > > >> >>>>>>> >>>>>>>> > > > > > > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from > the > > > doc: > > > > > >> >>>>>>> >>>>>>>> > > > > > > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common > need > > > to > > > > > >> understand dependencies between data pipelines and track > > end-to-end > > > > > >> provenance of data. It enables many use cases from ensuring > > reliable > > > > > >> delivery of data through observability to compliance and cost > > > > > management. > > > > > >> >>>>>>> >>>>>>>> > > > > > > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core > > Airflow > > > > > >> capability to enable troubleshooting and governance. > > > > > >> >>>>>>> >>>>>>>> > > > > > > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the > LFAI&Data > > > > > >> foundation that provides a spec standardizing operational > lineage > > > > > >> collection and sharing across the data ecosystem. If it provides > > > plugins > > > > > >> for popular open source projects, its intent is very similar to > > > > > >> OpenTelemetry (also under the Linux Foundation umbrella): to > > remain > > > a > > > > > spec > > > > > >> for lineage exchange that projects - open source or proprietary > - > > > > > implement. > > > > > >> >>>>>>> >>>>>>>> > > > > > > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will > > > make it > > > > > >> easier and more reliable for Airflow users to publish their > > > operational > > > > > >> lineage through the OpenLineage ecosystem. > > > > > >> >>>>>>> >>>>>>>> > > > > > > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the > > > > > >> OpenLineage project depends on Airflow and operators internals > and > > > gets > > > > > >> broken when changes are made on those. Having a built-in > > integration > > > > > >> ensures a better first class support to expose lineage that gets > > > tested > > > > > >> alongside other changes and therefore is more stable. > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> > > > > > >> >>>>>>> >>>>>> -- > > > > > >> >>>>>>> >>>>>> Eugene > > > > > >> >>>>>>> >>>>> > > > > > >> >>>>>>> >>>>> > > > > > >> >>>>>>> >>>>> > > > > > >> >>>>>>> >>>>> -- > > > > > >> >>>>>>> >>>>> Eugene > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > -- > > > > > >> >>>>>>> > Eugene > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Eugene > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org <mailto: > > > dev-unsubscr...@airflow.apache.org> > > > For additional commands, e-mail: dev-h...@airflow.apache.org <mailto: > > > dev-h...@airflow.apache.org> > > > > > > > > > > > > > > > > > > > > >