Hello all, I have created a Pull Request ( https://github.com/apache/airflow/pull/36929) to make it possible to create a dataset through the API as a modest step forward. This PR is open for your feedback. I'm preparing another PR to build upon the insights from https://github.com/apache/airflow/pull/29433. Your thoughts and contributions are highly encouraged.
Best Regards, Eduardo Nicastro On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro <edu.nicas...@gmail.com> wrote: > Hello all, > > I'm reaching out to propose a topic for discussion that has recently > emerged in our GitHub discussion threads (#36723 > <https://github.com/apache/airflow/discussions/36723>). It revolves > around enhancing the management of datasets in a multi-tenant Airflow > architecture. > > Use case/motivation > In our multi-instance setup, synchronizing dataset dependencies across > instances poses significant challenges. With the advent of dataset > listeners, a new door has opened for cross-instance dataset awareness. I > propose we explore creating endpoints to export dataset updates to make it > possible to trigger DAGs consuming from a Dataset across tenants. > > Context > Below I will give some context about our current situation and solution we > have in place and propose a new workflow that would be more efficient. To > be able to implement this new workflow we would need a way to export > Dataset updates as mentioned. > > Current Workflow > In our organization, we're dealing with multiple Airflow tenants, let's > say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A across > these tenants, we currently have a complex setup: > > 1. Containers run on a schedule to export metadata to CosmosDB (these > will be replaced by the listener). > 2. Additional scheduled containers pull data from CosmosDB and write > it to a shared file system, enabling generated DAGS to read it and mirror a > dataset across tenants. > > > Proposed Workflow > Here's a breakdown of our proposed workflow: > > 1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1 > producing Dataset A. We need a mechanism to trigger all Dags consuming > Dataset A in Tenant 2. This interaction is crucial for our data pipeline's > efficiency and synchronicity. > 2. Dataset Listener Implementation: Our approach involves implementing > a Dataset listener that programmatically creates Dataset A in all tenants > where it's not present (like Tenant 2) and export Dataset updates when they > happen. This would trigger an update on all Dags consuming from that > Dataset. > 3. Standardized Dataset Names: We plan to use standardized dataset > names, which makes sense since a URI is its identifier and uniqueness is a > logical requirement. > > [image: image.png] > > Why This Matters: > > - It offers a streamlined, automated way to manage datasets across > different Airflow instances. > - It aligns with a need for efficient, interconnected workflows in a > multi-tenant environment. > > > I invite the community to discuss: > > - Are there alternative methods within Airflow's current framework > that could achieve similar goals? > - Any insights or experiences that could inform our approach? > > Your feedback and suggestions are invaluable, and I look forward to a > collaborative discussion. > > Best Regards, > Eduardo Nicastro >