Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Constance Martineau Wed, 24 Jan 2024 11:35:17 -0800

You're right. I didn't mean to say that the Connections and Datasets
facilitate the same thing - they don't. I meant that Connections are also
"useless" if no task is using that Connection - but we allow them to be
created independently of dags. From that angle - I don't see how allowing
Datasets to be created independently is any different.


Also happy to hear from others about this.

On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> I'd love to hear what others - especially those who are involved in dataset
> creation and discussion more than me. I personally believe that
> conceptually connections and datasets are as far from each other as
> possible (I have no idea where the similarities of connections - which are
> essentially static configuration of credentials) and datasets (which are
> dynamic reflection of data being passed live between tasks) comes from. The
> only similarity I see is that they are both stored by Airflow in some table
> (and even not that if you use SecretsManager). So comparing those two is an
> apple to pear comparison if you ask me.
>
> But (despite my 4 years experience of creating Airflow) my actual
> experience with Datasets is limited, I've been mainly observing what was
> going on, so I would love to hear what those who created (and continue to
> think about future of) the datasets :).
>
> J,
>
> On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau
> <consta...@astronomer.io.invalid> wrote:
>
> > Right. That is why I was trying to make a distinction in the PR and in
> this
> > discussion between CRUD-ing Dataset Objects/Definitions vs creating and
> > deleting Dataset Events from the queue. Happy to standardize on whatever
> > terminology to make sure things are understood and we can have a
> productive
> > conversation.
> >
> > For Dataset Events - creating, reading and deleting them via API is IMHO
> > not controversial.
> > - For creating: This has been discussed in various places, and that the
> > endpoint could be used to trigger dependent dags
> > - For deleting: It is easy for DAGs with multiple upstream dependencies
> to
> > go out of sync, and there is no way to recover from that without
> > manipulating the DB directory. See here
> > <https://github.com/apache/airflow/discussions/36618> and here
> > <
> >
> https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801
> > >
> >
> > For CRUD-ing Dataset Definitions via API:
> >
> > > IMHO Airflow should only manage it's own entities and at most it should
> > > emit events (dataset listeners, openlineage API) to inform others about
> > > state changes of things that Airflow manages, but it should not be
> abused
> > > to store "other" datasets, that Airflow DAGs know nothing about.
> >
> >
> > I disagree that it is an abuse. If I as an internal data producer
> publish a
> > dataset that I expect internal Airflow users to use, it is not abusing
> > Airflow to create a dataset and make it visible in Airflow. At some point
> > in the near future, users will start referencing them in their dags -
> it's
> > just a sequencing question. We don't enforce connections being tied to a
> > dag - and conceptually - this is no different. It is also no different
> than
> > adding the definition as part of a dag file and having that dataset show
> up
> > in the dataset list, without forcing it to be a task output as part of a
> > dag. The only valid reason to now allow it IMHO is because they were
> > designed to be defined within a dag file, similar to a dag, and we don't
> > want to deal with the impediment I laid out.
> >
> > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> > > <consta...@astronomer.io.invalid> wrote:
> > >
> > > > I also think it makes sense to allow people to create/update/delete
> > > > Datasets via the API and eventually UI. Even if the dataset is not
> > > > initially connected to a DAG, it's nice to be able to see in one
> place
> > > all
> > > > the datasets and ML models that my dags can leverage. We allow people
> > to
> > > > create Connections and Variables via the API and UI without forcing
> > users
> > > > to use them as part of a task or dag. This isn't any different from
> > that
> > > > aspect.
> > > >
> > > > Airflow has some objects that cab
> > > > > be created by a dag processor (Dags, Datasets) and others that can
> be
> > > > > created with API/UI (Connections, Variables)
> > > >
> > > >
> > > A comment from my side. I think there is a big conceptual difference
> here
> > > that you yourself noticed - DAG code - via DAGProcessor - creates DAG
> and
> > > DataSets, and UI/API can allow to create and modify
> Connections/Variables
> > > that are then USED (but never created) by DAG code. This is why while I
> > see
> > > no fundamental security blocker with "Creating" Datasets via API - it
> > > definitely feels out-of-place to be able to manage them via API.
> > >
> > > And following the discussion from the PR -  Yes, we should discuss
> > create,
> > > update and delete differently. Because conceptually they are NOT
> typical
> > > CRUD (which the Connection / Variables API UI is).
> > > I think there is a huge difference between "Updating" and "Deleting"
> > > datasets via the API and the `UD` in CRUD:
> > >
> > > * Updating dataset does not actually "update" its definition, it
> informs
> > > those who listen on dataset that it has changed. No more, no less.
> > > Typically when you have CRUD operation, you pass the same data in "C"
> and
> > > "U" - but in our case those two operations are different and serve
> > > different purposes
> > > * Deleting the dataset is also not what "D" in CRUD is - in this case
> it
> > is
> > > mostly a "retention". And there are some very specific things here.
> > Should
> > > we delete a dataset that some of the DAGs still have as input/output ?
> > IMHO
> > > - absolutely not. But .... How do we know that? If we have only DAGs,
> > > implicitly creating Datasets by declaring whether they are used or not
> we
> > > can easily know that by reference counting. But when we allow the
> > creation
> > > of the datasets via API - it's no longer that obvious and the number of
> > > cases to handle gets really big.
> > >
> > > After seeing the comments and discussion - I believe it's not a good
> idea
> > > to allow external Dataset creations, the use case does not justify it
> > IMHO.
> > >
> > > Why ?
> > >
> > > We do not want Airflow to become a "dataset metadata storage" that you
> > can
> > > query/update and find out what all kinds of datasets the whole <data
> > lake>
> > > of yours has - this is not the purpose of Airflow, and will never be
> > IMHO.
> > > It's a non-goal for Airflow to keep "other" datasets.
> > >
> > > IMHO Airflow should only manage it's own entities and at most it should
> > > emit events (dataset listeners, openlineage API) to inform others about
> > > state changes of things that Airflow manages, but it should not be
> abused
> > > to store "other" datasets, that Airflow DAGs know nothing about. This -
> > in
> > > a way contradicts the "Airflow as a Platform" approach of ours and the
> > > whole concept of OpenLineage integration of Airflow. If you want to
> have
> > > single place where you store all the datasets you manage are, have all
> > your
> > > components emit open-lineage events and use a dedicated solution
> > (Marquez,
> > > Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones now
> > > consume Open Lineage events that pretty much all serious components
> > already
> > > emit - and there you can have it all. This is our strategic direction -
> > and
> > > this is why we accepted AIP-53 Open Lineage:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > > .
> > > At the moment we accepted it, we also accepted the fact that Airflow is
> > > just a producer of lineage data, not a storage nor consumer of it -
> > because
> > > this is the scope of AIP-53.
> > >
> > > I think the only way a dataset should be created in Airflow DB is via
> > > DagFileProcessor. With reference counting eventually and removal of
> > > datasets that are not used by anyone any more possibly - if we decide
> we
> > do
> > > not want to keep old datasets in DB. That should be it IMHO.
> > >
> > >
> > >
> > > >
> > > > --
> > > >
> > > > Constance Martineau
> > > > Senior Product Manager
> > > >
> > > > Email: consta...@astronomer.io
> > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > >
> > > >
> > > > <https://www.astronomer.io/>
> > > >
> > >
> >
> >
> > --
> >
> > Constance Martineau
> > Senior Product Manager
> >
> > Email: consta...@astronomer.io
> > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> >
> >
> > <https://www.astronomer.io/>
> >
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to