Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Jarek Potiuk Wed, 24 Jan 2024 10:55:50 -0800

I'd love to hear what others - especially those who are involved in dataset
creation and discussion more than me. I personally believe that
conceptually connections and datasets are as far from each other as
possible (I have no idea where the similarities of connections - which are
essentially static configuration of credentials) and datasets (which are
dynamic reflection of data being passed live between tasks) comes from. The
only similarity I see is that they are both stored by Airflow in some table
(and even not that if you use SecretsManager). So comparing those two is an
apple to pear comparison if you ask me.


But (despite my 4 years experience of creating Airflow) my actual
experience with Datasets is limited, I've been mainly observing what was
going on, so I would love to hear what those who created (and continue to
think about future of) the datasets :).

J,

On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau
<consta...@astronomer.io.invalid> wrote:

> Right. That is why I was trying to make a distinction in the PR and in this
> discussion between CRUD-ing Dataset Objects/Definitions vs creating and
> deleting Dataset Events from the queue. Happy to standardize on whatever
> terminology to make sure things are understood and we can have a productive
> conversation.
>
> For Dataset Events - creating, reading and deleting them via API is IMHO
> not controversial.
> - For creating: This has been discussed in various places, and that the
> endpoint could be used to trigger dependent dags
> - For deleting: It is easy for DAGs with multiple upstream dependencies to
> go out of sync, and there is no way to recover from that without
> manipulating the DB directory. See here
> <https://github.com/apache/airflow/discussions/36618> and here
> <
> https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801
> >
>
> For CRUD-ing Dataset Definitions via API:
>
> > IMHO Airflow should only manage it's own entities and at most it should
> > emit events (dataset listeners, openlineage API) to inform others about
> > state changes of things that Airflow manages, but it should not be abused
> > to store "other" datasets, that Airflow DAGs know nothing about.
>
>
> I disagree that it is an abuse. If I as an internal data producer publish a
> dataset that I expect internal Airflow users to use, it is not abusing
> Airflow to create a dataset and make it visible in Airflow. At some point
> in the near future, users will start referencing them in their dags - it's
> just a sequencing question. We don't enforce connections being tied to a
> dag - and conceptually - this is no different. It is also no different than
> adding the definition as part of a dag file and having that dataset show up
> in the dataset list, without forcing it to be a task output as part of a
> dag. The only valid reason to now allow it IMHO is because they were
> designed to be defined within a dag file, similar to a dag, and we don't
> want to deal with the impediment I laid out.
>
> On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> > <consta...@astronomer.io.invalid> wrote:
> >
> > > I also think it makes sense to allow people to create/update/delete
> > > Datasets via the API and eventually UI. Even if the dataset is not
> > > initially connected to a DAG, it's nice to be able to see in one place
> > all
> > > the datasets and ML models that my dags can leverage. We allow people
> to
> > > create Connections and Variables via the API and UI without forcing
> users
> > > to use them as part of a task or dag. This isn't any different from
> that
> > > aspect.
> > >
> > > Airflow has some objects that cab
> > > > be created by a dag processor (Dags, Datasets) and others that can be
> > > > created with API/UI (Connections, Variables)
> > >
> > >
> > A comment from my side. I think there is a big conceptual difference here
> > that you yourself noticed - DAG code - via DAGProcessor - creates DAG and
> > DataSets, and UI/API can allow to create and modify Connections/Variables
> > that are then USED (but never created) by DAG code. This is why while I
> see
> > no fundamental security blocker with "Creating" Datasets via API - it
> > definitely feels out-of-place to be able to manage them via API.
> >
> > And following the discussion from the PR -  Yes, we should discuss
> create,
> > update and delete differently. Because conceptually they are NOT typical
> > CRUD (which the Connection / Variables API UI is).
> > I think there is a huge difference between "Updating" and "Deleting"
> > datasets via the API and the `UD` in CRUD:
> >
> > * Updating dataset does not actually "update" its definition, it informs
> > those who listen on dataset that it has changed. No more, no less.
> > Typically when you have CRUD operation, you pass the same data in "C" and
> > "U" - but in our case those two operations are different and serve
> > different purposes
> > * Deleting the dataset is also not what "D" in CRUD is - in this case it
> is
> > mostly a "retention". And there are some very specific things here.
> Should
> > we delete a dataset that some of the DAGs still have as input/output ?
> IMHO
> > - absolutely not. But .... How do we know that? If we have only DAGs,
> > implicitly creating Datasets by declaring whether they are used or not we
> > can easily know that by reference counting. But when we allow the
> creation
> > of the datasets via API - it's no longer that obvious and the number of
> > cases to handle gets really big.
> >
> > After seeing the comments and discussion - I believe it's not a good idea
> > to allow external Dataset creations, the use case does not justify it
> IMHO.
> >
> > Why ?
> >
> > We do not want Airflow to become a "dataset metadata storage" that you
> can
> > query/update and find out what all kinds of datasets the whole <data
> lake>
> > of yours has - this is not the purpose of Airflow, and will never be
> IMHO.
> > It's a non-goal for Airflow to keep "other" datasets.
> >
> > IMHO Airflow should only manage it's own entities and at most it should
> > emit events (dataset listeners, openlineage API) to inform others about
> > state changes of things that Airflow manages, but it should not be abused
> > to store "other" datasets, that Airflow DAGs know nothing about. This -
> in
> > a way contradicts the "Airflow as a Platform" approach of ours and the
> > whole concept of OpenLineage integration of Airflow. If you want to have
> > single place where you store all the datasets you manage are, have all
> your
> > components emit open-lineage events and use a dedicated solution
> (Marquez,
> > Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones now
> > consume Open Lineage events that pretty much all serious components
> already
> > emit - and there you can have it all. This is our strategic direction -
> and
> > this is why we accepted AIP-53 Open Lineage:
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > .
> > At the moment we accepted it, we also accepted the fact that Airflow is
> > just a producer of lineage data, not a storage nor consumer of it -
> because
> > this is the scope of AIP-53.
> >
> > I think the only way a dataset should be created in Airflow DB is via
> > DagFileProcessor. With reference counting eventually and removal of
> > datasets that are not used by anyone any more possibly - if we decide we
> do
> > not want to keep old datasets in DB. That should be it IMHO.
> >
> >
> >
> > >
> > > --
> > >
> > > Constance Martineau
> > > Senior Product Manager
> > >
> > > Email: consta...@astronomer.io
> > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > >
> > >
> > > <https://www.astronomer.io/>
> > >
> >
>
>
> --
>
> Constance Martineau
> Senior Product Manager
>
> Email: consta...@astronomer.io
> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>
>
> <https://www.astronomer.io/>
>

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to