Right. That is why I was trying to make a distinction in the PR and in this discussion between CRUD-ing Dataset Objects/Definitions vs creating and deleting Dataset Events from the queue. Happy to standardize on whatever terminology to make sure things are understood and we can have a productive conversation.
For Dataset Events - creating, reading and deleting them via API is IMHO not controversial. - For creating: This has been discussed in various places, and that the endpoint could be used to trigger dependent dags - For deleting: It is easy for DAGs with multiple upstream dependencies to go out of sync, and there is no way to recover from that without manipulating the DB directory. See here <https://github.com/apache/airflow/discussions/36618> and here <https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801> For CRUD-ing Dataset Definitions via API: > IMHO Airflow should only manage it's own entities and at most it should > emit events (dataset listeners, openlineage API) to inform others about > state changes of things that Airflow manages, but it should not be abused > to store "other" datasets, that Airflow DAGs know nothing about. I disagree that it is an abuse. If I as an internal data producer publish a dataset that I expect internal Airflow users to use, it is not abusing Airflow to create a dataset and make it visible in Airflow. At some point in the near future, users will start referencing them in their dags - it's just a sequencing question. We don't enforce connections being tied to a dag - and conceptually - this is no different. It is also no different than adding the definition as part of a dag file and having that dataset show up in the dataset list, without forcing it to be a task output as part of a dag. The only valid reason to now allow it IMHO is because they were designed to be defined within a dag file, similar to a dag, and we don't want to deal with the impediment I laid out. On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com> wrote: > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau > <consta...@astronomer.io.invalid> wrote: > > > I also think it makes sense to allow people to create/update/delete > > Datasets via the API and eventually UI. Even if the dataset is not > > initially connected to a DAG, it's nice to be able to see in one place > all > > the datasets and ML models that my dags can leverage. We allow people to > > create Connections and Variables via the API and UI without forcing users > > to use them as part of a task or dag. This isn't any different from that > > aspect. > > > > Airflow has some objects that cab > > > be created by a dag processor (Dags, Datasets) and others that can be > > > created with API/UI (Connections, Variables) > > > > > A comment from my side. I think there is a big conceptual difference here > that you yourself noticed - DAG code - via DAGProcessor - creates DAG and > DataSets, and UI/API can allow to create and modify Connections/Variables > that are then USED (but never created) by DAG code. This is why while I see > no fundamental security blocker with "Creating" Datasets via API - it > definitely feels out-of-place to be able to manage them via API. > > And following the discussion from the PR - Yes, we should discuss create, > update and delete differently. Because conceptually they are NOT typical > CRUD (which the Connection / Variables API UI is). > I think there is a huge difference between "Updating" and "Deleting" > datasets via the API and the `UD` in CRUD: > > * Updating dataset does not actually "update" its definition, it informs > those who listen on dataset that it has changed. No more, no less. > Typically when you have CRUD operation, you pass the same data in "C" and > "U" - but in our case those two operations are different and serve > different purposes > * Deleting the dataset is also not what "D" in CRUD is - in this case it is > mostly a "retention". And there are some very specific things here. Should > we delete a dataset that some of the DAGs still have as input/output ? IMHO > - absolutely not. But .... How do we know that? If we have only DAGs, > implicitly creating Datasets by declaring whether they are used or not we > can easily know that by reference counting. But when we allow the creation > of the datasets via API - it's no longer that obvious and the number of > cases to handle gets really big. > > After seeing the comments and discussion - I believe it's not a good idea > to allow external Dataset creations, the use case does not justify it IMHO. > > Why ? > > We do not want Airflow to become a "dataset metadata storage" that you can > query/update and find out what all kinds of datasets the whole <data lake> > of yours has - this is not the purpose of Airflow, and will never be IMHO. > It's a non-goal for Airflow to keep "other" datasets. > > IMHO Airflow should only manage it's own entities and at most it should > emit events (dataset listeners, openlineage API) to inform others about > state changes of things that Airflow manages, but it should not be abused > to store "other" datasets, that Airflow DAGs know nothing about. This - in > a way contradicts the "Airflow as a Platform" approach of ours and the > whole concept of OpenLineage integration of Airflow. If you want to have > single place where you store all the datasets you manage are, have all your > components emit open-lineage events and use a dedicated solution (Marquez, > Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones now > consume Open Lineage events that pretty much all serious components already > emit - and there you can have it all. This is our strategic direction - and > this is why we accepted AIP-53 Open Lineage: > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow > . > At the moment we accepted it, we also accepted the fact that Airflow is > just a producer of lineage data, not a storage nor consumer of it - because > this is the scope of AIP-53. > > I think the only way a dataset should be created in Airflow DB is via > DagFileProcessor. With reference counting eventually and removal of > datasets that are not used by anyone any more possibly - if we decide we do > not want to keep old datasets in DB. That should be it IMHO. > > > > > > > -- > > > > Constance Martineau > > Senior Product Manager > > > > Email: consta...@astronomer.io > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4) > > > > > > <https://www.astronomer.io/> > > > -- Constance Martineau Senior Product Manager Email: consta...@astronomer.io Time zone: US Eastern (EST UTC-5 / EDT UTC-4) <https://www.astronomer.io/>