You're right. I didn't mean to say that the Connections and Datasets facilitate the same thing - they don't. I meant that Connections are also "useless" if no task is using that Connection - but we allow them to be created independently of dags. From that angle - I don't see how allowing Datasets to be created independently is any different.
Also happy to hear from others about this. On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote: > I'd love to hear what others - especially those who are involved in dataset > creation and discussion more than me. I personally believe that > conceptually connections and datasets are as far from each other as > possible (I have no idea where the similarities of connections - which are > essentially static configuration of credentials) and datasets (which are > dynamic reflection of data being passed live between tasks) comes from. The > only similarity I see is that they are both stored by Airflow in some table > (and even not that if you use SecretsManager). So comparing those two is an > apple to pear comparison if you ask me. > > But (despite my 4 years experience of creating Airflow) my actual > experience with Datasets is limited, I've been mainly observing what was > going on, so I would love to hear what those who created (and continue to > think about future of) the datasets :). > > J, > > On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau > <consta...@astronomer.io.invalid> wrote: > > > Right. That is why I was trying to make a distinction in the PR and in > this > > discussion between CRUD-ing Dataset Objects/Definitions vs creating and > > deleting Dataset Events from the queue. Happy to standardize on whatever > > terminology to make sure things are understood and we can have a > productive > > conversation. > > > > For Dataset Events - creating, reading and deleting them via API is IMHO > > not controversial. > > - For creating: This has been discussed in various places, and that the > > endpoint could be used to trigger dependent dags > > - For deleting: It is easy for DAGs with multiple upstream dependencies > to > > go out of sync, and there is no way to recover from that without > > manipulating the DB directory. See here > > <https://github.com/apache/airflow/discussions/36618> and here > > < > > > https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801 > > > > > > > For CRUD-ing Dataset Definitions via API: > > > > > IMHO Airflow should only manage it's own entities and at most it should > > > emit events (dataset listeners, openlineage API) to inform others about > > > state changes of things that Airflow manages, but it should not be > abused > > > to store "other" datasets, that Airflow DAGs know nothing about. > > > > > > I disagree that it is an abuse. If I as an internal data producer > publish a > > dataset that I expect internal Airflow users to use, it is not abusing > > Airflow to create a dataset and make it visible in Airflow. At some point > > in the near future, users will start referencing them in their dags - > it's > > just a sequencing question. We don't enforce connections being tied to a > > dag - and conceptually - this is no different. It is also no different > than > > adding the definition as part of a dag file and having that dataset show > up > > in the dataset list, without forcing it to be a task output as part of a > > dag. The only valid reason to now allow it IMHO is because they were > > designed to be defined within a dag file, similar to a dag, and we don't > > want to deal with the impediment I laid out. > > > > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau > > > <consta...@astronomer.io.invalid> wrote: > > > > > > > I also think it makes sense to allow people to create/update/delete > > > > Datasets via the API and eventually UI. Even if the dataset is not > > > > initially connected to a DAG, it's nice to be able to see in one > place > > > all > > > > the datasets and ML models that my dags can leverage. We allow people > > to > > > > create Connections and Variables via the API and UI without forcing > > users > > > > to use them as part of a task or dag. This isn't any different from > > that > > > > aspect. > > > > > > > > Airflow has some objects that cab > > > > > be created by a dag processor (Dags, Datasets) and others that can > be > > > > > created with API/UI (Connections, Variables) > > > > > > > > > > > A comment from my side. I think there is a big conceptual difference > here > > > that you yourself noticed - DAG code - via DAGProcessor - creates DAG > and > > > DataSets, and UI/API can allow to create and modify > Connections/Variables > > > that are then USED (but never created) by DAG code. This is why while I > > see > > > no fundamental security blocker with "Creating" Datasets via API - it > > > definitely feels out-of-place to be able to manage them via API. > > > > > > And following the discussion from the PR - Yes, we should discuss > > create, > > > update and delete differently. Because conceptually they are NOT > typical > > > CRUD (which the Connection / Variables API UI is). > > > I think there is a huge difference between "Updating" and "Deleting" > > > datasets via the API and the `UD` in CRUD: > > > > > > * Updating dataset does not actually "update" its definition, it > informs > > > those who listen on dataset that it has changed. No more, no less. > > > Typically when you have CRUD operation, you pass the same data in "C" > and > > > "U" - but in our case those two operations are different and serve > > > different purposes > > > * Deleting the dataset is also not what "D" in CRUD is - in this case > it > > is > > > mostly a "retention". And there are some very specific things here. > > Should > > > we delete a dataset that some of the DAGs still have as input/output ? > > IMHO > > > - absolutely not. But .... How do we know that? If we have only DAGs, > > > implicitly creating Datasets by declaring whether they are used or not > we > > > can easily know that by reference counting. But when we allow the > > creation > > > of the datasets via API - it's no longer that obvious and the number of > > > cases to handle gets really big. > > > > > > After seeing the comments and discussion - I believe it's not a good > idea > > > to allow external Dataset creations, the use case does not justify it > > IMHO. > > > > > > Why ? > > > > > > We do not want Airflow to become a "dataset metadata storage" that you > > can > > > query/update and find out what all kinds of datasets the whole <data > > lake> > > > of yours has - this is not the purpose of Airflow, and will never be > > IMHO. > > > It's a non-goal for Airflow to keep "other" datasets. > > > > > > IMHO Airflow should only manage it's own entities and at most it should > > > emit events (dataset listeners, openlineage API) to inform others about > > > state changes of things that Airflow manages, but it should not be > abused > > > to store "other" datasets, that Airflow DAGs know nothing about. This - > > in > > > a way contradicts the "Airflow as a Platform" approach of ours and the > > > whole concept of OpenLineage integration of Airflow. If you want to > have > > > single place where you store all the datasets you manage are, have all > > your > > > components emit open-lineage events and use a dedicated solution > > (Marquez, > > > Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones now > > > consume Open Lineage events that pretty much all serious components > > already > > > emit - and there you can have it all. This is our strategic direction - > > and > > > this is why we accepted AIP-53 Open Lineage: > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow > > > . > > > At the moment we accepted it, we also accepted the fact that Airflow is > > > just a producer of lineage data, not a storage nor consumer of it - > > because > > > this is the scope of AIP-53. > > > > > > I think the only way a dataset should be created in Airflow DB is via > > > DagFileProcessor. With reference counting eventually and removal of > > > datasets that are not used by anyone any more possibly - if we decide > we > > do > > > not want to keep old datasets in DB. That should be it IMHO. > > > > > > > > > > > > > > > > > -- > > > > > > > > Constance Martineau > > > > Senior Product Manager > > > > > > > > Email: consta...@astronomer.io > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4) > > > > > > > > > > > > <https://www.astronomer.io/> > > > > > > > > > > > > > -- > > > > Constance Martineau > > Senior Product Manager > > > > Email: consta...@astronomer.io > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4) > > > > > > <https://www.astronomer.io/> > > > -- Constance Martineau Senior Product Manager Email: consta...@astronomer.io Time zone: US Eastern (EST UTC-5 / EDT UTC-4) <https://www.astronomer.io/>