I am also pretty cool with adding/updating/datasets externally, however I know there are some ongoing discussions on how to improve/change datasets and bind them together with multiple other features of Airflow - not sure what the state of those, but it would be great those effort are coordinated so that we are not pulling stuff in multiple directions.
>From what I've heard/overheard noticed about Datasets are those things: * AIP-60 - https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-60+Standard+URI+representation+for+Airflow+Datasets - already almost passed * Better coupling of datasets with OpenLineage * Partial datasets - allowing to have datasets with data intervals * Triggering dags on external dataset changes * Objects Storage integration with datasets All of which sound very promising and are definitely important for Dataset usage. So I think we really make sure when we are doing anything with datasets, the people who think/work on those aspects above have a say in those proposals/discussions - it would be a shame if we add something that will partially invalidate or make terribly complex to implement some of the other things. I am not saying it's the case here, I am just saying that we should at least make sure that people who are currently thinking about these things don't come surprised if we merge something that will make their job harder. I am a little surprised - knowing the *thinking* happening in dataset area that I am aware of that there are so little comments on that one (even if "hey looks cool" - works well for the things I am thinking about) :). J. On Tue, Jan 23, 2024 at 3:53 AM Ryan Hatter <ryan.hat...@astronomer.io.invalid> wrote: > I don't think it makes sense to include the create endpoint without also > including dataset update and delete endpoints and updating the Datasets > view in the UI to be able to manage externally created Datasets. > > With that said, I don't think the fact that Datasets are tightly coupled > with DAGs is a good reason not to include additional Dataset endpoints. It > makes sense to me to be able to interact with Datasets from outside of > Airflow. > > On Sat, Jan 20, 2024 at 6:13 AM Eduardo Nicastro <edu.nicas...@gmail.com> > wrote: > > > Hello all, I have created a Pull Request ( > > https://github.com/apache/airflow/pull/36929) to make it possible to > > create a dataset through the API as a modest step forward. This PR is > open > > for your feedback. I'm preparing another PR to build upon the insights > from > > https://github.com/apache/airflow/pull/29433. Your thoughts and > > contributions are highly encouraged. > > > > Best Regards, > > Eduardo Nicastro > > > > On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro <edu.nicas...@gmail.com > > > > wrote: > > > >> Hello all, > >> > >> I'm reaching out to propose a topic for discussion that has recently > >> emerged in our GitHub discussion threads (#36723 > >> <https://github.com/apache/airflow/discussions/36723>). It revolves > >> around enhancing the management of datasets in a multi-tenant Airflow > >> architecture. > >> > >> Use case/motivation > >> In our multi-instance setup, synchronizing dataset dependencies across > >> instances poses significant challenges. With the advent of dataset > >> listeners, a new door has opened for cross-instance dataset awareness. I > >> propose we explore creating endpoints to export dataset updates to make > it > >> possible to trigger DAGs consuming from a Dataset across tenants. > >> > >> Context > >> Below I will give some context about our current situation and solution > >> we have in place and propose a new workflow that would be more > efficient. > >> To be able to implement this new workflow we would need a way to export > >> Dataset updates as mentioned. > >> > >> Current Workflow > >> In our organization, we're dealing with multiple Airflow tenants, let's > >> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A across > >> these tenants, we currently have a complex setup: > >> > >> 1. Containers run on a schedule to export metadata to CosmosDB (these > >> will be replaced by the listener). > >> 2. Additional scheduled containers pull data from CosmosDB and write > >> it to a shared file system, enabling generated DAGS to read it and > mirror a > >> dataset across tenants. > >> > >> > >> Proposed Workflow > >> Here's a breakdown of our proposed workflow: > >> > >> 1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1 > >> producing Dataset A. We need a mechanism to trigger all Dags > consuming > >> Dataset A in Tenant 2. This interaction is crucial for our data > pipeline's > >> efficiency and synchronicity. > >> 2. Dataset Listener Implementation: Our approach involves > >> implementing a Dataset listener that programmatically creates > Dataset A in > >> all tenants where it's not present (like Tenant 2) and export Dataset > >> updates when they happen. This would trigger an update on all Dags > >> consuming from that Dataset. > >> 3. Standardized Dataset Names: We plan to use standardized dataset > >> names, which makes sense since a URI is its identifier and > uniqueness is a > >> logical requirement. > >> > >> [image: image.png] > >> > >> Why This Matters: > >> > >> - It offers a streamlined, automated way to manage datasets across > >> different Airflow instances. > >> - It aligns with a need for efficient, interconnected workflows in a > >> multi-tenant environment. > >> > >> > >> I invite the community to discuss: > >> > >> - Are there alternative methods within Airflow's current framework > >> that could achieve similar goals? > >> - Any insights or experiences that could inform our approach? > >> > >> Your feedback and suggestions are invaluable, and I look forward to a > >> collaborative discussion. > >> > >> Best Regards, > >> Eduardo Nicastro > >> > > >