Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Constance Martineau Wed, 24 Jan 2024 10:27:49 -0800

Right. That is why I was trying to make a distinction in the PR and in this
discussion between CRUD-ing Dataset Objects/Definitions vs creating and
deleting Dataset Events from the queue. Happy to standardize on whatever
terminology to make sure things are understood and we can have a productive
conversation.


For Dataset Events - creating, reading and deleting them via API is IMHO
not controversial.
- For creating: This has been discussed in various places, and that the
endpoint could be used to trigger dependent dags
- For deleting: It is easy for DAGs with multiple upstream dependencies to
go out of sync, and there is no way to recover from that without
manipulating the DB directory. See here
<https://github.com/apache/airflow/discussions/36618> and here
<https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801>

For CRUD-ing Dataset Definitions via API:

> IMHO Airflow should only manage it's own entities and at most it should
> emit events (dataset listeners, openlineage API) to inform others about
> state changes of things that Airflow manages, but it should not be abused
> to store "other" datasets, that Airflow DAGs know nothing about.


I disagree that it is an abuse. If I as an internal data producer publish a
dataset that I expect internal Airflow users to use, it is not abusing
Airflow to create a dataset and make it visible in Airflow. At some point
in the near future, users will start referencing them in their dags - it's
just a sequencing question. We don't enforce connections being tied to a
dag - and conceptually - this is no different. It is also no different than
adding the definition as part of a dag file and having that dataset show up
in the dataset list, without forcing it to be a task output as part of a
dag. The only valid reason to now allow it IMHO is because they were
designed to be defined within a dag file, similar to a dag, and we don't
want to deal with the impediment I laid out.

On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> <consta...@astronomer.io.invalid> wrote:
>
> > I also think it makes sense to allow people to create/update/delete
> > Datasets via the API and eventually UI. Even if the dataset is not
> > initially connected to a DAG, it's nice to be able to see in one place
> all
> > the datasets and ML models that my dags can leverage. We allow people to
> > create Connections and Variables via the API and UI without forcing users
> > to use them as part of a task or dag. This isn't any different from that
> > aspect.
> >
> > Airflow has some objects that cab
> > > be created by a dag processor (Dags, Datasets) and others that can be
> > > created with API/UI (Connections, Variables)
> >
> >
> A comment from my side. I think there is a big conceptual difference here
> that you yourself noticed - DAG code - via DAGProcessor - creates DAG and
> DataSets, and UI/API can allow to create and modify Connections/Variables
> that are then USED (but never created) by DAG code. This is why while I see
> no fundamental security blocker with "Creating" Datasets via API - it
> definitely feels out-of-place to be able to manage them via API.
>
> And following the discussion from the PR -  Yes, we should discuss create,
> update and delete differently. Because conceptually they are NOT typical
> CRUD (which the Connection / Variables API UI is).
> I think there is a huge difference between "Updating" and "Deleting"
> datasets via the API and the `UD` in CRUD:
>
> * Updating dataset does not actually "update" its definition, it informs
> those who listen on dataset that it has changed. No more, no less.
> Typically when you have CRUD operation, you pass the same data in "C" and
> "U" - but in our case those two operations are different and serve
> different purposes
> * Deleting the dataset is also not what "D" in CRUD is - in this case it is
> mostly a "retention". And there are some very specific things here. Should
> we delete a dataset that some of the DAGs still have as input/output ? IMHO
> - absolutely not. But .... How do we know that? If we have only DAGs,
> implicitly creating Datasets by declaring whether they are used or not we
> can easily know that by reference counting. But when we allow the creation
> of the datasets via API - it's no longer that obvious and the number of
> cases to handle gets really big.
>
> After seeing the comments and discussion - I believe it's not a good idea
> to allow external Dataset creations, the use case does not justify it IMHO.
>
> Why ?
>
> We do not want Airflow to become a "dataset metadata storage" that you can
> query/update and find out what all kinds of datasets the whole <data lake>
> of yours has - this is not the purpose of Airflow, and will never be IMHO.
> It's a non-goal for Airflow to keep "other" datasets.
>
> IMHO Airflow should only manage it's own entities and at most it should
> emit events (dataset listeners, openlineage API) to inform others about
> state changes of things that Airflow manages, but it should not be abused
> to store "other" datasets, that Airflow DAGs know nothing about. This - in
> a way contradicts the "Airflow as a Platform" approach of ours and the
> whole concept of OpenLineage integration of Airflow. If you want to have
> single place where you store all the datasets you manage are, have all your
> components emit open-lineage events and use a dedicated solution (Marquez,
> Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones now
> consume Open Lineage events that pretty much all serious components already
> emit - and there you can have it all. This is our strategic direction - and
> this is why we accepted AIP-53 Open Lineage:
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> .
> At the moment we accepted it, we also accepted the fact that Airflow is
> just a producer of lineage data, not a storage nor consumer of it - because
> this is the scope of AIP-53.
>
> I think the only way a dataset should be created in Airflow DB is via
> DagFileProcessor. With reference counting eventually and removal of
> datasets that are not used by anyone any more possibly - if we decide we do
> not want to keep old datasets in DB. That should be it IMHO.
>
>
>
> >
> > --
> >
> > Constance Martineau
> > Senior Product Manager
> >
> > Email: consta...@astronomer.io
> > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> >
> >
> > <https://www.astronomer.io/>
> >
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to