Hi Julien.

I reviewed the design doc.
The general idea looks good to me, but I have some concerns that I would
like to share.

If I understand correctly the proposed design is to fill in "operators"
with self-methods to extract lineage metadata from it, and I agree with the
motivation. If those are decoupled (in a form of extractors in separate
package) from operators itself, then the downsides is that (as you
mentioned) - extractors will be distributed separately and "operators"
logic is out of sync with "lineage extraction" logic by design.
Also knowledge about internals of operator spills out of the operator which
is not good at all (at the very least).

However, if we make every operator being exposing method to generate
lineage metadata of the specific format, e.g. OpenLineage etc., then we
will end up with cartesian complexity of supporting in each
provider+operator each backend format.

If you say that the goal is that "operators" will always
generate OpenLineage format only and each consumer will convert this format
to their own internal representation, well, if they do this then this seems
like a working approach. But with the assumption that each consumer will
support it.

I think it comes down to the question: is OpenLineage format enough
popular, complete and proper for the lineage metadata that every consumer
will be convinced to support it. We may also consider issues like mismatch
of lineage feature parity, e.g. OpenLineage supports field-level lineage
but consumer doesn't support (or not at the moment), so we would prefer
lineage metadata transferred to the backend to be slightly different in
this case.

What do you think about the idea:
1. make lineage metadata generated by "operators" to be agnostic of the
specific format, just using entities from big generic vocabulary of
entities e.g. created here
https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py. We
would have there e.g. entities like:
--------------------------------------------------------------------
@attr.s(auto_attribs=True, kw_only=True)
class PostgresTable:
    """Airflow lineage entity representing Postgres table."""

    host: str = attr.ib()
    port: str = attr.ib()
    database: str = attr.ib()
    schema: str = attr.ib()
    table: str = attr.ib()

@attr.s(auto_attribs=True, kw_only=True)
class GCSEntity:
    """Airflow lineage entity representing generic Google Cloud Storage
entity."""

    bucket: str = attr.ib()
    path: str = attr.ib()

@attr.s(auto_attribs=True, kw_only=True)
class AWSS3Entity:
    """Airflow lineage entity representing generic AWS S3 entity."""

    bucket: str = attr.ib()
    path: str = attr.ib()
--------------------------------------------------------------------
2. Implement "adapters" that will act as a bridge between "operators" and
backends. Their responsibility will be to convert lineage metadata
generated by "operators" to a format understandable by specific backend.
And then we can use the built-in mechanism of inlets/outlets to bypass
Airflow lineage metadata to the Airflow lineage backend.

I didn't get exactly implementation details of your proposed design, but I
think maintaining global vocabulary of entities to use in inlets/outlets of
operators is crucial for Airflow, as this could be leveraged to build
various features on top of it, like displaying lineage graph in Airflow UI
(based on XCOM):)

Importantly to note, if we decide to send out from Airflow lineage metadata
only in OpenLineage format, well, we could have than only one "adapter"
OpenLineageAdapter. But the "adapters" approach leaves us room for adding
support to others (following "pluggable" approach as Airflow is mainly
known/good about).

All in all:
- global vocabulary of entities used across all "operators" (with all
advantages out of it, mentioned above)
- "adapters" approach
seems to me crucial points in the design that make sense to me.

What do you think about this?

- Eugene


On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem <[email protected]>
wrote:

> Hello Michał,
> Thank you for your input.
> I would clarify that OpenLineage doesn't make any assumption about the
> backend being used to store lineage and is an adapter-like layer.
> OpenLineage exists as the spec specifically for that purpose of avoiding
> the problem of every lineage consumer having to understand every lineage
> producer.
> Consumers of lineage want a unified spec consuming lineage from any data
> transformation layer like Airflow, Spark, Flink, SQL, Warehouses, ...
> Just like OpenTelemetry allows consuming traces independently of the
> technology used, so does OpenLineage for lineage.
> Julien
>
> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> As Airflow already supports lineage functionality through pluggable
>> lineage backends, I think OpenLineage and other lineage systems integration
>> should follow this path. I think more 'native' integration with OpenLineage
>> (or any other lineage system) in Airflow while maintaining the generic
>> lineage backend architecture in parallel would make the user experience
>> less open, troublesome to maintain, and the Airflow architecture itself
>> more constrained by a logic of a specific system.
>>
>> I think enriching operators with a generic method exposing lineage
>> metadata that could be leveraged by lineage backends regardless of their
>> implementation is a good idea which the Cloud Composer team would gladly
>> contribute to. I believe the translation of the Airflow metadata exposed by
>> the operators should be done by lineage backends (or another adapter-like
>> layer). Tying Airflow operators' development to a specific lineage system
>> like OpenLineage forces operators' contributors to understand that system
>> too, which increases both the entry costs and maintenance costs. I see it
>> as unnecessary coupling.
>>
>> Best,
>> Michal
>>
>>
>>
>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <[email protected]>
>> wrote:
>>
>>> Thank you Eugen,
>>> This sounds very aligned with the goals of OpenLineage and I think this
>>> would work well.
>>> Here are the sections in the doc that I think address your points:
>>> *- generalize lineage metadata extraction as self-method in each
>>> operator, using generic lineage entities*
>>> See: OpenLineage support in providers
>>> <https://docs.google.com/document/d/1aN5i8WV2Za7XiHTtyrewZscQ-4eXs1ZNfPw58JscFEw/edit#heading=h.n53oowz38zuf>
>>> . It describes how each operator exposes its lineage.
>>> *- implement "adapter"s to convert generated metadata to Data Lineage
>>> format, Open Lineage format, etc.*
>>> The goal here is each consumer turns from OpenLineage format to their
>>> own internal representation as you are suggesting.
>>> In the motivation section
>>> <https://docs.google.com/document/d/1aN5i8WV2Za7XiHTtyrewZscQ-4eXs1ZNfPw58JscFEw/edit#heading=h.8siih5lo2c33>,
>>> towards the end, I link to a few examples of data catalogs doing just that.
>>>
>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <[email protected]> wrote:
>>>
>>>> ++ Michal Modras
>>>>
>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <[email protected]>
>>>> wrote:
>>>>
>>>>> Cloud Composer recently launched "Data lineage with Dataplex" feature
>>>>> which effectively means to generate lineage out of DAG/task executions and
>>>>> export it to Data Lineage (Data Catalog service) for further analysis.
>>>>> https://cloud.google.com/composer/docs/composer-2/lineage-integration
>>>>>
>>>>> This feature is as of now in the "Preview" state.
>>>>> The current implementation uses built-in "Airflow lineage
>>>>> backend" feature and methods to extract lineage metadata on task
>>>>> post execution events.
>>>>>
>>>>> The general idea was to contribute this to the Airflow community in a
>>>>> form:
>>>>> - generalize lineage metadata extraction as self-method in each
>>>>> operator, using generic lineage entities
>>>>> - implement "adapter"s to convert generated metadata to Data Lineage
>>>>> format, Open Lineage format, etc.
>>>>>
>>>>> Adoption of "Airflow OpenLineage" for Composer would mean to introduce
>>>>> an additional layer of converting from OpenLineage format to Data Lineage
>>>>> (Data Catalog/Dataplex) format. But this is definitely a possibility.
>>>>>
>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Thank you very much for your input Jarek.
>>>>>> I am responding in the comments and adding to the doc accordingly.
>>>>>> I would also love to hear from more stakeholders.
>>>>>> Thanks to all who provided feedback so far.
>>>>>> Julien
>>>>>>
>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> General comment from my side: I think Open Lineage is (and should be
>>>>>>> even more) a feature of Airflow that expands Airflow's capabilities
>>>>>>> greatly and opens up the direction we've been all working on -
>>>>>>> Airflow
>>>>>>> as a Platform.
>>>>>>>
>>>>>>> I think closely integrating it with Open-Lineage goes the same
>>>>>>> direction (also mentioned in the doc) as Open Telemetry goes, where
>>>>>>> we
>>>>>>> might decide to support certain standards in order to expand
>>>>>>> capabilities of Airflow-as-a-platform and allows to plug-in multiple
>>>>>>> external solutions that would use the standard API. After
>>>>>>> Open-Lineage
>>>>>>> graduated recently to  LFAI&Data foundation (I've been watching this
>>>>>>> happening from far), it is I think the perfect candidate for Airflow
>>>>>>> to incorporate it. I hope this will help all the players to make use
>>>>>>> of the extra work necessary by the community to make it "officially
>>>>>>> supported". I think we have to also get some feedback from the big
>>>>>>> stakeholders in Airflow - because one thing is to have such a
>>>>>>> capability, and another is to get it used in all the ways Airflow is
>>>>>>> used - not only by on-premise/self-hosted users (which is obviously a
>>>>>>> huge driving factor) but also everywhere where Airflow is exposed by
>>>>>>> others - Astronomer is obviously on-board. we see some warm words
>>>>>>> from
>>>>>>> Amazon (mentioned by Julian), I would love to hear whether the
>>>>>>> Composer team at Google would be on board in using the open-lineage
>>>>>>> information exposed this way in their Data Catalog (and likely more)
>>>>>>> offering. We have Amundsen and others and possibly other stakeholders
>>>>>>> might want to say something.
>>>>>>>
>>>>>>>
>>>>>>> There is - undoubtedly - an extra effort involved in implementing and
>>>>>>> keeping it running smoothly (as Julian mentioned, that is the main
>>>>>>> reason why the Open Lineage community would like to make the
>>>>>>> integration part of Airflow. But by being smart and integrating it in
>>>>>>> the way that will allow to plug-it-in into our CI, verification
>>>>>>> process and making some very clear expectations about what it means
>>>>>>> for contributors to Airflow to get it running, we can make some
>>>>>>> initial investment in making it happen and minimise on-going cost,
>>>>>>> while maximising the gain.
>>>>>>>
>>>>>>> And looking at all the above - I am super happy to help with all that
>>>>>>> to make this easy to "swallow" and integrate well, even if it will
>>>>>>> take an extra effort, especially that we will have experts from Open
>>>>>>> Lineage who worked with both Airflow and Open Lineage being the core
>>>>>>> part of the effort. I am actually super excited - this might be the
>>>>>>> next-big-thing for Airflow to strengthen its position as an
>>>>>>> indispensable component of "even more modern data stack".
>>>>>>>
>>>>>>> I made my initial comments in the doc, and am looking forward to
>>>>>>> making it happen :).
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
>>>>>>> <[email protected]> wrote:
>>>>>>> >
>>>>>>> > Dear Airflow Community,
>>>>>>> > I have been working on a proposal to bring an OpenLineage provider
>>>>>>> to Airflow.
>>>>>>> > I am looking for feedback with the goal to post an official AIP.
>>>>>>> > Please feel free to comment in the doc above.
>>>>>>> > Thank you,
>>>>>>> > Julien (OpenLineage project lead)
>>>>>>> >
>>>>>>> > For convenience, here is the rationale from the doc:
>>>>>>> >
>>>>>>> > Operational lineage collection is a common need to understand
>>>>>>> dependencies between data pipelines and track end-to-end provenance of
>>>>>>> data. It enables many use cases from ensuring reliable delivery of data
>>>>>>> through observability to compliance and cost management.
>>>>>>> >
>>>>>>> > Publishing operational lineage is a core Airflow capability to
>>>>>>> enable troubleshooting and governance.
>>>>>>> >
>>>>>>> > OpenLineage is a project part of the LFAI&Data foundation that
>>>>>>> provides a spec standardizing operational lineage collection and sharing
>>>>>>> across the data ecosystem. If it provides plugins for popular open 
>>>>>>> source
>>>>>>> projects, its intent is very similar to OpenTelemetry (also under the 
>>>>>>> Linux
>>>>>>> Foundation umbrella): to remain a spec for lineage exchange that 
>>>>>>> projects -
>>>>>>> open source or proprietary - implement.
>>>>>>> >
>>>>>>> > Built-in OpenLineage support in Airflow will make it easier and
>>>>>>> more reliable for Airflow users to publish their operational lineage
>>>>>>> through the OpenLineage ecosystem.
>>>>>>> >
>>>>>>> > The current external plugin maintained in the OpenLineage project
>>>>>>> depends on Airflow and operators internals and gets broken when changes 
>>>>>>> are
>>>>>>> made on those. Having a built-in integration ensures a better first 
>>>>>>> class
>>>>>>> support to expose lineage that gets tested alongside other changes and
>>>>>>> therefore is more stable.
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Eugene
>>>>>
>>>>
>>>>
>>>> --
>>>> Eugene
>>>>
>>>

-- 
Eugene

Reply via email to