I've had a few conversations with Astronomer customers within the past few days who are looking for an approved way to create datasets outside of the dag parsing process. They are already - or are considering - using some sort of custom process similar to what Steve suggested in the github discussion <https://github.com/apache/airflow/discussions/36723#discussioncomment-8243269>.
Given those conversations and the feedback from PRs, Github Discussions, and this dev thread, I appreciate that there's a need that Airflow isn't filling today. To gather more support, we need a proper answer about how we will deal with clashes between the imperative and declarative approach. As a Product Manager - I do not have the skillset to figure this out on my own - but would be happy to work with someone in the community on this. On Thu, Jan 25, 2024 at 9:58 AM Eduardo Nicastro <edu.nicas...@gmail.com> wrote: > Thanks, Potiuk, for highlighting the importance of aligning new features > with Airflow's roadmap. I agree we need to be cautious about expanding > dataset functionalities in ways that might conflict with existing or > planned features. However, this approach doesn't necessarily transform > Airflow into a 'dataset metadata storage' but rather enhances its role as a > centralized orchestrator, making datasets more visible and manageable. > > Tornike G., you raise a valid concern about mixing declarative and > imperative approaches. We need to think carefully about how API-created > datasets would coexist with those defined in DAG files. However, in my > opinion, this is a natural transition that will likely become necessary as > Airflow is used in increasingly diverse environments and organizations, a > shift that seems inevitable. > > Constance M., your perspective on enabling API/UI management for datasets > is spot-on. It adds a layer of flexibility and visibility that's crucial > for modern data orchestration, aligning well with Airflow's goals of being > a comprehensive workflow platform without overstepping its primary > functions. > > To add my perspective, echoing some of what I posted in the GH discussion ( > https://github.com/apache/airflow/discussions/36723): Data-aware > scheduling > was a transformative step for Airflow because it acknowledged data as the > primary workflow trigger. This proposal is essentially an extension of that > concept, further decoupling Airflow from the assumption that only DAGs can > influence datasets. I also believe it aligns with the modern data > engineering practices where workflows are increasingly driven by data > events and think this is particularly interesting for larger organizations > where datasets frequently span across various systems and teams. > > > On Wed, Jan 24, 2024 at 8:53 PM Tornike Gurgenidze < > togur...@freeuni.edu.ge> > wrote: > > > What I meant by update/delete operations was referring to Dataset objects > > themselves, not DatasetEvents. I also see no issue in allowing dataset > > changes to be registered externally. I admit that deleting datasets is > > probably irrelevant as even now they are not deleted, but instead > orphaned > > after reference counting, but U in CRUD is still very much relevant imho. > > There's a field called extra in DatasetModel for example which has no use > > inside airflow, but it still might be used from user code in all sorts of > > ways. > > > > I'm not saying it's impossible for these interfaces to coexist if you > > isolate them from one another, especially when multiple dag-processors > > already do something similar for dags even now (isolating sets of objects > > from one another using processor_subdir value), it just feels unnatural > to > > have a declarative (dag code) and imperative (API/UI) interfaces for > > interacting with one type of objects. > > > > On Wed, Jan 24, 2024 at 11:35 PM Constance Martineau > > <consta...@astronomer.io.invalid> wrote: > > > > > You're right. I didn't mean to say that the Connections and Datasets > > > facilitate the same thing - they don't. I meant that Connections are > also > > > "useless" if no task is using that Connection - but we allow them to be > > > created independently of dags. From that angle - I don't see how > allowing > > > Datasets to be created independently is any different. > > > > > > Also happy to hear from others about this. > > > > > > On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > > I'd love to hear what others - especially those who are involved in > > > dataset > > > > creation and discussion more than me. I personally believe that > > > > conceptually connections and datasets are as far from each other as > > > > possible (I have no idea where the similarities of connections - > which > > > are > > > > essentially static configuration of credentials) and datasets (which > > are > > > > dynamic reflection of data being passed live between tasks) comes > from. > > > The > > > > only similarity I see is that they are both stored by Airflow in some > > > table > > > > (and even not that if you use SecretsManager). So comparing those two > > is > > > an > > > > apple to pear comparison if you ask me. > > > > > > > > But (despite my 4 years experience of creating Airflow) my actual > > > > experience with Datasets is limited, I've been mainly observing what > > was > > > > going on, so I would love to hear what those who created (and > continue > > to > > > > think about future of) the datasets :). > > > > > > > > J, > > > > > > > > On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau > > > > <consta...@astronomer.io.invalid> wrote: > > > > > > > > > Right. That is why I was trying to make a distinction in the PR and > > in > > > > this > > > > > discussion between CRUD-ing Dataset Objects/Definitions vs creating > > and > > > > > deleting Dataset Events from the queue. Happy to standardize on > > > whatever > > > > > terminology to make sure things are understood and we can have a > > > > productive > > > > > conversation. > > > > > > > > > > For Dataset Events - creating, reading and deleting them via API is > > > IMHO > > > > > not controversial. > > > > > - For creating: This has been discussed in various places, and that > > the > > > > > endpoint could be used to trigger dependent dags > > > > > - For deleting: It is easy for DAGs with multiple upstream > > dependencies > > > > to > > > > > go out of sync, and there is no way to recover from that without > > > > > manipulating the DB directory. See here > > > > > <https://github.com/apache/airflow/discussions/36618> and here > > > > > < > > > > > > > > > > > > > > > https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801 > > > > > > > > > > > > > > > > For CRUD-ing Dataset Definitions via API: > > > > > > > > > > > IMHO Airflow should only manage it's own entities and at most it > > > should > > > > > > emit events (dataset listeners, openlineage API) to inform others > > > about > > > > > > state changes of things that Airflow manages, but it should not > be > > > > abused > > > > > > to store "other" datasets, that Airflow DAGs know nothing about. > > > > > > > > > > > > > > > I disagree that it is an abuse. If I as an internal data producer > > > > publish a > > > > > dataset that I expect internal Airflow users to use, it is not > > abusing > > > > > Airflow to create a dataset and make it visible in Airflow. At some > > > point > > > > > in the near future, users will start referencing them in their > dags - > > > > it's > > > > > just a sequencing question. We don't enforce connections being tied > > to > > > a > > > > > dag - and conceptually - this is no different. It is also no > > different > > > > than > > > > > adding the definition as part of a dag file and having that dataset > > > show > > > > up > > > > > in the dataset list, without forcing it to be a task output as part > > of > > > a > > > > > dag. The only valid reason to now allow it IMHO is because they > were > > > > > designed to be defined within a dag file, similar to a dag, and we > > > don't > > > > > want to deal with the impediment I laid out. > > > > > > > > > > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com> > > > wrote: > > > > > > > > > > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau > > > > > > <consta...@astronomer.io.invalid> wrote: > > > > > > > > > > > > > I also think it makes sense to allow people to > > create/update/delete > > > > > > > Datasets via the API and eventually UI. Even if the dataset is > > not > > > > > > > initially connected to a DAG, it's nice to be able to see in > one > > > > place > > > > > > all > > > > > > > the datasets and ML models that my dags can leverage. We allow > > > people > > > > > to > > > > > > > create Connections and Variables via the API and UI without > > forcing > > > > > users > > > > > > > to use them as part of a task or dag. This isn't any different > > from > > > > > that > > > > > > > aspect. > > > > > > > > > > > > > > Airflow has some objects that cab > > > > > > > > be created by a dag processor (Dags, Datasets) and others > that > > > can > > > > be > > > > > > > > created with API/UI (Connections, Variables) > > > > > > > > > > > > > > > > > > > > A comment from my side. I think there is a big conceptual > > difference > > > > here > > > > > > that you yourself noticed - DAG code - via DAGProcessor - creates > > DAG > > > > and > > > > > > DataSets, and UI/API can allow to create and modify > > > > Connections/Variables > > > > > > that are then USED (but never created) by DAG code. This is why > > > while I > > > > > see > > > > > > no fundamental security blocker with "Creating" Datasets via API > - > > it > > > > > > definitely feels out-of-place to be able to manage them via API. > > > > > > > > > > > > And following the discussion from the PR - Yes, we should > discuss > > > > > create, > > > > > > update and delete differently. Because conceptually they are NOT > > > > typical > > > > > > CRUD (which the Connection / Variables API UI is). > > > > > > I think there is a huge difference between "Updating" and > > "Deleting" > > > > > > datasets via the API and the `UD` in CRUD: > > > > > > > > > > > > * Updating dataset does not actually "update" its definition, it > > > > informs > > > > > > those who listen on dataset that it has changed. No more, no > less. > > > > > > Typically when you have CRUD operation, you pass the same data in > > "C" > > > > and > > > > > > "U" - but in our case those two operations are different and > serve > > > > > > different purposes > > > > > > * Deleting the dataset is also not what "D" in CRUD is - in this > > case > > > > it > > > > > is > > > > > > mostly a "retention". And there are some very specific things > here. > > > > > Should > > > > > > we delete a dataset that some of the DAGs still have as > > input/output > > > ? > > > > > IMHO > > > > > > - absolutely not. But .... How do we know that? If we have only > > DAGs, > > > > > > implicitly creating Datasets by declaring whether they are used > or > > > not > > > > we > > > > > > can easily know that by reference counting. But when we allow the > > > > > creation > > > > > > of the datasets via API - it's no longer that obvious and the > > number > > > of > > > > > > cases to handle gets really big. > > > > > > > > > > > > After seeing the comments and discussion - I believe it's not a > > good > > > > idea > > > > > > to allow external Dataset creations, the use case does not > justify > > it > > > > > IMHO. > > > > > > > > > > > > Why ? > > > > > > > > > > > > We do not want Airflow to become a "dataset metadata storage" > that > > > you > > > > > can > > > > > > query/update and find out what all kinds of datasets the whole > > <data > > > > > lake> > > > > > > of yours has - this is not the purpose of Airflow, and will never > > be > > > > > IMHO. > > > > > > It's a non-goal for Airflow to keep "other" datasets. > > > > > > > > > > > > IMHO Airflow should only manage it's own entities and at most it > > > should > > > > > > emit events (dataset listeners, openlineage API) to inform others > > > about > > > > > > state changes of things that Airflow manages, but it should not > be > > > > abused > > > > > > to store "other" datasets, that Airflow DAGs know nothing about. > > > This - > > > > > in > > > > > > a way contradicts the "Airflow as a Platform" approach of ours > and > > > the > > > > > > whole concept of OpenLineage integration of Airflow. If you want > to > > > > have > > > > > > single place where you store all the datasets you manage are, > have > > > all > > > > > your > > > > > > components emit open-lineage events and use a dedicated solution > > > > > (Marquez, > > > > > > Amundsen, Google Data Catalog etc. etc. ) - all of the serious > ones > > > now > > > > > > consume Open Lineage events that pretty much all serious > components > > > > > already > > > > > > emit - and there you can have it all. This is our strategic > > > direction - > > > > > and > > > > > > this is why we accepted AIP-53 Open Lineage: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow > > > > > > . > > > > > > At the moment we accepted it, we also accepted the fact that > > Airflow > > > is > > > > > > just a producer of lineage data, not a storage nor consumer of > it - > > > > > because > > > > > > this is the scope of AIP-53. > > > > > > > > > > > > I think the only way a dataset should be created in Airflow DB is > > via > > > > > > DagFileProcessor. With reference counting eventually and removal > of > > > > > > datasets that are not used by anyone any more possibly - if we > > decide > > > > we > > > > > do > > > > > > not want to keep old datasets in DB. That should be it IMHO. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Constance Martineau > > > > > > > Senior Product Manager > > > > > > > > > > > > > > Email: consta...@astronomer.io > > > > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4) > > > > > > > > > > > > > > > > > > > > > <https://www.astronomer.io/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Constance Martineau > > > > > Senior Product Manager > > > > > > > > > > Email: consta...@astronomer.io > > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4) > > > > > > > > > > > > > > > <https://www.astronomer.io/> > > > > > > > > > > > > > > > > > > -- > > > > > > Constance Martineau > > > Senior Product Manager > > > > > > Email: consta...@astronomer.io > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4) > > > > > > > > > <https://www.astronomer.io/> > > > > > > > > > -- > > პატივისცემით, > > თორნიკე გურგენიძე, > > ESM-ის მესამე კურსის სტუდენტი, XI ჯგუფი > > >