Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Constance Martineau Wed, 24 Jan 2024 08:33:51 -0800

I also think it makes sense to allow people to create/update/delete
Datasets via the API and eventually UI. Even if the dataset is not
initially connected to a DAG, it's nice to be able to see in one place all
the datasets and ML models that my dags can leverage. We allow people to
create Connections and Variables via the API and UI without forcing users
to use them as part of a task or dag. This isn't any different from that
aspect.


Airflow has some objects that cab
> be created by a dag processor (Dags, Datasets) and others that can be
> created with API/UI (Connections, Variables)


@Tornike Gurgenidze <togur...@freeuni.edu.ge> brings up a valid point
though: How would we handle changes coming from the API or UI for datasets
that are defined via a dag file? The difference afaik is that if I choose
to define a connection or variable via a dag file, I have to create a
session and explicitly save it to the DB versus instantiating a Connection
or Variable.

On Tue, Jan 23, 2024 at 8:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Clarifying: There is no (and it has never been) a problem with opening up
> submitting "structured" DAGs.
>
> On Tue, Jan 23, 2024 at 2:12 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > > I always assumed that this was the reason why it's impossible to create
> > dags from API, no one wanted to open this particular can of worms. I
> think
> > if you need to synchronize these objects, the cleaner way would be to
> > describe them in some sort of a shared config file and let respective
> > dag-processors create them independently of each other.
> >
> > Just to clarify this one: - creating DAGs via API has been resented
> mostly
> > because of security reasons - where you would want to submit Python DAG
> > code via API. There is (and it has never been) a problem with opening up
> > submitting "structured" DAGs. This has never been implemented, but if you
> > would like to limit to just modifying or creating resulting DAG
> structure,
> > that would be possible - for example there is no fundamental problem with
> > generating a DAG from (say) visual representation and submitting a
> > resulting DAG structure without creating a DAG python file (so
> essentially
> > playing the role of DAG file processor to serialize DAGs). It would have
> a
> > number of limitations (for example callbacks would not work., timetables
> > would be a challenge etc.), but other than that it's quite possible (and
> > possibly even in the future we might have something like that).
> >
> > Following that - there are no fundamental problems with submitting
> > datasets - because they are not Python code, they are pure "metadata"
> > objects.
> >
> > Still the questions remains how it plays with the DAG-created datasets is
> > an important aspect of the proposal.
> >
> > J.
> >
> >
> > On Tue, Jan 23, 2024 at 2:01 PM Tornike Gurgenidze <
> > togur...@freeuni.edu.ge> wrote:
> >
> >> Maybe I'm missing something, but I can't see how rest endpoints for
> >> datasets could work in practice. afaik, Airflow has some objects that
> cab
> >> be created by a dag processor (Dags, Datasets) and others that can be
> >> created with API/UI (Connections, Variables), but never both at the same
> >> time. How would update/delete endpoints work if Dataset was initially
> >> created declaratively from a dag file? Would it throw an exception or
> make
> >> an update that will then be reverted in a little while by a
> dag-processor
> >> anyway?
> >>
> >> I always assumed that this was the reason why it's impossible to create
> >> dags from API, no one wanted to open this particular can of worms. I
> think
> >> if you need to synchronize these objects, the cleaner way would be to
> >> describe them in some sort of a shared config file and let respective
> >> dag-processors create them independently of each other.
> >>
> >> On Tue, Jan 23, 2024 at 4:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >>
> >> > I am also pretty cool with adding/updating/datasets externally,
> however
> >> I
> >> > know there are some ongoing discussions on how to improve/change
> >> datasets
> >> > and bind them together with multiple other features of Airflow - not
> >> sure
> >> > what the state of those, but it would be great those effort are
> >> coordinated
> >> > so that we are not pulling stuff in multiple directions.
> >> >
> >> > From what I've heard/overheard noticed about Datasets are those
> things:
> >> >
> >> > * AIP-60  -
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-60+Standard+URI+representation+for+Airflow+Datasets
> >> > - already almost passed
> >> > * Better coupling of datasets with OpenLineage
> >> > * Partial datasets - allowing to have datasets with data intervals
> >> > * Triggering dags on external dataset changes
> >> > * Objects Storage integration with datasets
> >> >
> >> > All of which sound very promising and are definitely important for
> >> Dataset
> >> > usage.
> >> >
> >> > So I think we really make sure when we are doing anything with
> datasets,
> >> > the people who think/work on those aspects above have a say in those
> >> > proposals/discussions - it would be a shame if we add something that
> >> will
> >> > partially invalidate or make terribly complex to implement some of the
> >> > other things.
> >> >
> >> > I am not saying it's the case here, I am just saying that we should at
> >> > least make sure that people who are currently thinking about these
> >> things
> >> > don't come surprised if we merge something that will make their job
> >> harder.
> >> >
> >> > I am a little surprised - knowing the *thinking* happening in dataset
> >> area
> >> > that I am aware of that there are so little comments on that one (even
> >> if
> >> > "hey looks cool" - works well for the things I am thinking about) :).
> >> >
> >> > J.
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Jan 23, 2024 at 3:53 AM Ryan Hatter
> >> > <ryan.hat...@astronomer.io.invalid> wrote:
> >> >
> >> > > I don't think it makes sense to include the create endpoint without
> >> also
> >> > > including dataset update and delete endpoints and updating the
> >> Datasets
> >> > > view in the UI to be able to manage externally created Datasets.
> >> > >
> >> > > With that said, I don't think the fact that Datasets are tightly
> >> coupled
> >> > > with DAGs is a good reason not to include additional Dataset
> >> endpoints.
> >> > It
> >> > > makes sense to me to be able to interact with Datasets from outside
> of
> >> > > Airflow.
> >> > >
> >> > > On Sat, Jan 20, 2024 at 6:13 AM Eduardo Nicastro <
> >> edu.nicas...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hello all, I have created a Pull Request (
> >> > > > https://github.com/apache/airflow/pull/36929) to make it possible
> >> to
> >> > > > create a dataset through the API as a modest step forward. This PR
> >> is
> >> > > open
> >> > > > for your feedback. I'm preparing another PR to build upon the
> >> insights
> >> > > from
> >> > > > https://github.com/apache/airflow/pull/29433. Your thoughts and
> >> > > > contributions are highly encouraged.
> >> > > >
> >> > > > Best Regards,
> >> > > > Eduardo Nicastro
> >> > > >
> >> > > > On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro <
> >> > edu.nicas...@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > >> Hello all,
> >> > > >>
> >> > > >> I'm reaching out to propose a topic for discussion that has
> >> recently
> >> > > >> emerged in our GitHub discussion threads (#36723
> >> > > >> <https://github.com/apache/airflow/discussions/36723>). It
> >> revolves
> >> > > >> around enhancing the management of datasets in a multi-tenant
> >> Airflow
> >> > > >> architecture.
> >> > > >>
> >> > > >> Use case/motivation
> >> > > >> In our multi-instance setup, synchronizing dataset dependencies
> >> across
> >> > > >> instances poses significant challenges. With the advent of
> dataset
> >> > > >> listeners, a new door has opened for cross-instance dataset
> >> > awareness. I
> >> > > >> propose we explore creating endpoints to export dataset updates
> to
> >> > make
> >> > > it
> >> > > >> possible to trigger DAGs consuming from a Dataset across tenants.
> >> > > >>
> >> > > >> Context
> >> > > >> Below I will give some context about our current situation and
> >> > solution
> >> > > >> we have in place and propose a new workflow that would be more
> >> > > efficient.
> >> > > >> To be able to implement this new workflow we would need a way to
> >> > export
> >> > > >> Dataset updates as mentioned.
> >> > > >>
> >> > > >> Current Workflow
> >> > > >> In our organization, we're dealing with multiple Airflow tenants,
> >> > let's
> >> > > >> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A
> >> > across
> >> > > >> these tenants, we currently have a complex setup:
> >> > > >>
> >> > > >>    1. Containers run on a schedule to export metadata to CosmosDB
> >> > (these
> >> > > >>    will be replaced by the listener).
> >> > > >>    2. Additional scheduled containers pull data from CosmosDB and
> >> > write
> >> > > >>    it to a shared file system, enabling generated DAGS to read it
> >> and
> >> > > mirror a
> >> > > >>    dataset across tenants.
> >> > > >>
> >> > > >>
> >> > > >> Proposed Workflow
> >> > > >> Here's a breakdown of our proposed workflow:
> >> > > >>
> >> > > >>    1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1
> >> > > >>    producing Dataset A. We need a mechanism to trigger all Dags
> >> > > consuming
> >> > > >>    Dataset A in Tenant 2. This interaction is crucial for our
> data
> >> > > pipeline's
> >> > > >>    efficiency and synchronicity.
> >> > > >>    2. Dataset Listener Implementation: Our approach involves
> >> > > >>    implementing a Dataset listener that programmatically creates
> >> > > Dataset A in
> >> > > >>    all tenants where it's not present (like Tenant 2) and export
> >> > Dataset
> >> > > >>    updates when they happen. This would trigger an update on all
> >> Dags
> >> > > >>    consuming from that Dataset.
> >> > > >>    3. Standardized Dataset Names: We plan to use standardized
> >> dataset
> >> > > >>    names, which makes sense since a URI is its identifier and
> >> > > uniqueness is a
> >> > > >>    logical requirement.
> >> > > >>
> >> > > >> [image: image.png]
> >> > > >>
> >> > > >> Why This Matters:
> >> > > >>
> >> > > >>    - It offers a streamlined, automated way to manage datasets
> >> across
> >> > > >>    different Airflow instances.
> >> > > >>    - It aligns with a need for efficient, interconnected
> workflows
> >> in
> >> > a
> >> > > >>    multi-tenant environment.
> >> > > >>
> >> > > >>
> >> > > >> I invite the community to discuss:
> >> > > >>
> >> > > >>    - Are there alternative methods within Airflow's current
> >> framework
> >> > > >>    that could achieve similar goals?
> >> > > >>    - Any insights or experiences that could inform our approach?
> >> > > >>
> >> > > >> Your feedback and suggestions are invaluable, and I look forward
> >> to a
> >> > > >> collaborative discussion.
> >> > > >>
> >> > > >> Best Regards,
> >> > > >> Eduardo Nicastro
> >> > > >>
> >> > > >
> >> > >
> >> >
> >>
> >
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to