Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Jarek Potiuk Wed, 24 Jan 2024 09:45:28 -0800

On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
<consta...@astronomer.io.invalid> wrote:


> I also think it makes sense to allow people to create/update/delete
> Datasets via the API and eventually UI. Even if the dataset is not
> initially connected to a DAG, it's nice to be able to see in one place all
> the datasets and ML models that my dags can leverage. We allow people to
> create Connections and Variables via the API and UI without forcing users
> to use them as part of a task or dag. This isn't any different from that
> aspect.
>
> Airflow has some objects that cab
> > be created by a dag processor (Dags, Datasets) and others that can be
> > created with API/UI (Connections, Variables)
>
>
A comment from my side. I think there is a big conceptual difference here
that you yourself noticed - DAG code - via DAGProcessor - creates DAG and
DataSets, and UI/API can allow to create and modify Connections/Variables
that are then USED (but never created) by DAG code. This is why while I see
no fundamental security blocker with "Creating" Datasets via API - it
definitely feels out-of-place to be able to manage them via API.

And following the discussion from the PR -  Yes, we should discuss create,
update and delete differently. Because conceptually they are NOT typical
CRUD (which the Connection / Variables API UI is).
I think there is a huge difference between "Updating" and "Deleting"
datasets via the API and the `UD` in CRUD:

* Updating dataset does not actually "update" its definition, it informs
those who listen on dataset that it has changed. No more, no less.
Typically when you have CRUD operation, you pass the same data in "C" and
"U" - but in our case those two operations are different and serve
different purposes
* Deleting the dataset is also not what "D" in CRUD is - in this case it is
mostly a "retention". And there are some very specific things here. Should
we delete a dataset that some of the DAGs still have as input/output ? IMHO
- absolutely not. But .... How do we know that? If we have only DAGs,
implicitly creating Datasets by declaring whether they are used or not we
can easily know that by reference counting. But when we allow the creation
of the datasets via API - it's no longer that obvious and the number of
cases to handle gets really big.

After seeing the comments and discussion - I believe it's not a good idea
to allow external Dataset creations, the use case does not justify it IMHO.

Why ?

We do not want Airflow to become a "dataset metadata storage" that you can
query/update and find out what all kinds of datasets the whole <data lake>
of yours has - this is not the purpose of Airflow, and will never be IMHO.
It's a non-goal for Airflow to keep "other" datasets.

IMHO Airflow should only manage it's own entities and at most it should
emit events (dataset listeners, openlineage API) to inform others about
state changes of things that Airflow manages, but it should not be abused
to store "other" datasets, that Airflow DAGs know nothing about. This - in
a way contradicts the "Airflow as a Platform" approach of ours and the
whole concept of OpenLineage integration of Airflow. If you want to have
single place where you store all the datasets you manage are, have all your
components emit open-lineage events and use a dedicated solution (Marquez,
Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones now
consume Open Lineage events that pretty much all serious components already
emit - and there you can have it all. This is our strategic direction - and
this is why we accepted AIP-53 Open Lineage:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow.
At the moment we accepted it, we also accepted the fact that Airflow is
just a producer of lineage data, not a storage nor consumer of it - because
this is the scope of AIP-53.

I think the only way a dataset should be created in Airflow DB is via
DagFileProcessor. With reference counting eventually and removal of
datasets that are not used by anyone any more possibly - if we decide we do
not want to keep old datasets in DB. That should be it IMHO.



>
> --
>
> Constance Martineau
> Senior Product Manager
>
> Email: consta...@astronomer.io
> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>
>
> <https://www.astronomer.io/>
>

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to