Hello all, I have created a Pull Request (
https://github.com/apache/airflow/pull/36929) to make it possible to create
a dataset through the API as a modest step forward. This PR is open for
your feedback. I'm preparing another PR to build upon the insights from
https://github.com/apache/airflow/pull/29433. Your thoughts and
contributions are highly encouraged.

Best Regards,
Eduardo Nicastro

On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro <edu.nicas...@gmail.com>
wrote:

> Hello all,
>
> I'm reaching out to propose a topic for discussion that has recently
> emerged in our GitHub discussion threads (#36723
> <https://github.com/apache/airflow/discussions/36723>). It revolves
> around enhancing the management of datasets in a multi-tenant Airflow
> architecture.
>
> Use case/motivation
> In our multi-instance setup, synchronizing dataset dependencies across
> instances poses significant challenges. With the advent of dataset
> listeners, a new door has opened for cross-instance dataset awareness. I
> propose we explore creating endpoints to export dataset updates to make it
> possible to trigger DAGs consuming from a Dataset across tenants.
>
> Context
> Below I will give some context about our current situation and solution we
> have in place and propose a new workflow that would be more efficient. To
> be able to implement this new workflow we would need a way to export
> Dataset updates as mentioned.
>
> Current Workflow
> In our organization, we're dealing with multiple Airflow tenants, let's
> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A across
> these tenants, we currently have a complex setup:
>
>    1. Containers run on a schedule to export metadata to CosmosDB (these
>    will be replaced by the listener).
>    2. Additional scheduled containers pull data from CosmosDB and write
>    it to a shared file system, enabling generated DAGS to read it and mirror a
>    dataset across tenants.
>
>
> Proposed Workflow
> Here's a breakdown of our proposed workflow:
>
>    1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1
>    producing Dataset A. We need a mechanism to trigger all Dags consuming
>    Dataset A in Tenant 2. This interaction is crucial for our data pipeline's
>    efficiency and synchronicity.
>    2. Dataset Listener Implementation: Our approach involves implementing
>    a Dataset listener that programmatically creates Dataset A in all tenants
>    where it's not present (like Tenant 2) and export Dataset updates when they
>    happen. This would trigger an update on all Dags consuming from that
>    Dataset.
>    3. Standardized Dataset Names: We plan to use standardized dataset
>    names, which makes sense since a URI is its identifier and uniqueness is a
>    logical requirement.
>
> [image: image.png]
>
> Why This Matters:
>
>    - It offers a streamlined, automated way to manage datasets across
>    different Airflow instances.
>    - It aligns with a need for efficient, interconnected workflows in a
>    multi-tenant environment.
>
>
> I invite the community to discuss:
>
>    - Are there alternative methods within Airflow's current framework
>    that could achieve similar goals?
>    - Any insights or experiences that could inform our approach?
>
> Your feedback and suggestions are invaluable, and I look forward to a
> collaborative discussion.
>
> Best Regards,
> Eduardo Nicastro
>

Reply via email to