Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Maciej Obuchowski Tue, 13 Feb 2024 10:05:55 -0800

Hey, sorry for the late reply. I work on a feature that also touches
scheduling based on Datasets that I
would want to expand in a future email, but here I wanted to define what
the database model of a
Dataset is, and I agree with Jarek and Tornike.


It seems to me that generally there are two ways of managing state -
declarative and imperative.
Currently, Airflow marries them both - manages objects like DAGs or
Datasets declaratively, while
for example, DagRuns or DatasetEvents are imperative. The difference is
that DAGs and Datasets
reflect the underlying state described by, in this case, DAG files. When
DAG is removed, it
disappears from the UI, the database record is still there, but it's
inaccessible for a user.
On the other hand, DagRuns and DatasetEvents reflect the history of
operations on those objects.
The way Airflow manages them reflects that distinction.

So, we need to actually define what the database model reflects.
I personally would define it as a marker on which some DAG is scheduled
with.
I would even argue that it's useless to have it in a database for purely
`outlet` datasets, aka output
datasets that no DAG is scheduled on, or at least similar to "inactive"
DAGs have a flag that indicate
whether something is scheduled on that dataset. It would unlock a really
nice property that any
dataset in the table represents something schedulable and worth creating
DatasetEvents for.

I think it would simplify original Eduardo proposal, since you don't care
whether Dataset object
exists or not, just send the event, and the Airflow instance will determine
whether something is
scheduled or not - act of creating a dataset database entity is superficial
in that context.

> I meant that Connections are also
> "useless" if no task is using that Connection - but we allow them to be
> created independently of dags.

I think there are few differences, but I think the most important is the
state one. Connections don't
need to reflect any underlying state, at most they can be initialized by
things like env variables.
If Airflow forced users to use a secrets backend, it would be like it - but
I doubt Airflow would support
creating them from UI. Datasets URIs can mutate, and management of that
state while also having
external changes can be a nightmare
Another difference is chicken and egg one - if you don't allow Connections
to be defined before any
DAG is scheduled to use them, then you practically force users to create
DAGs that can't successfully
run.

pon., 29 sty 2024 o 18:51 Constance Martineau
<[email protected]> napisał(a):

> I've had a few conversations with Astronomer customers within the past few
> days who are looking for an approved way to create datasets outside of the
> dag parsing process. They are already - or are considering - using some
> sort of custom process similar to what Steve suggested in the github
> discussion
> <
> https://github.com/apache/airflow/discussions/36723#discussioncomment-8243269
> >.
>
>
> Given those conversations and the feedback from PRs, Github Discussions,
> and this dev thread, I appreciate that there's a need that Airflow isn't
> filling today. To gather more support, we need a proper answer about how we
> will deal with clashes between the imperative and declarative approach. As
> a Product Manager - I do not have the skillset to figure this out on my own
> - but would be happy to work with someone in the community on this.
>
> On Thu, Jan 25, 2024 at 9:58 AM Eduardo Nicastro <[email protected]>
> wrote:
>
> > Thanks, Potiuk, for highlighting the importance of aligning new features
> > with Airflow's roadmap. I agree we need to be cautious about expanding
> > dataset functionalities in ways that might conflict with existing or
> > planned features. However, this approach doesn't necessarily transform
> > Airflow into a 'dataset metadata storage' but rather enhances its role
> as a
> > centralized orchestrator, making datasets more visible and manageable.
> >
> > Tornike G., you raise a valid concern about mixing declarative and
> > imperative approaches. We need to think carefully about how API-created
> > datasets would coexist with those defined in DAG files. However, in my
> > opinion, this is a natural transition that will likely become necessary
> as
> > Airflow is used in increasingly diverse environments and organizations, a
> > shift that seems inevitable.
> >
> > Constance M., your perspective on enabling API/UI management for datasets
> > is spot-on. It adds a layer of flexibility and visibility that's crucial
> > for modern data orchestration, aligning well with Airflow's goals of
> being
> > a comprehensive workflow platform without overstepping its primary
> > functions.
> >
> > To add my perspective, echoing some of what I posted in the GH
> discussion (
> > https://github.com/apache/airflow/discussions/36723): Data-aware
> > scheduling
> > was a transformative step for Airflow because it acknowledged data as the
> > primary workflow trigger. This proposal is essentially an extension of
> that
> > concept, further decoupling Airflow from the assumption that only DAGs
> can
> > influence datasets. I also believe it aligns with the modern data
> > engineering practices where workflows are increasingly driven by data
> > events and think this is particularly interesting for larger
> organizations
> > where datasets frequently span across various systems and teams.
> >
> >
> > On Wed, Jan 24, 2024 at 8:53 PM Tornike Gurgenidze <
> > [email protected]>
> > wrote:
> >
> > > What I meant by update/delete operations was referring to Dataset
> objects
> > > themselves, not DatasetEvents. I also see no issue in allowing dataset
> > > changes to be registered externally. I admit that deleting datasets is
> > > probably irrelevant as even now they are not deleted, but instead
> > orphaned
> > > after reference counting, but U in CRUD is still very much relevant
> imho.
> > > There's a field called extra in DatasetModel for example which has no
> use
> > > inside airflow, but it still might be used from user code in all sorts
> of
> > > ways.
> > >
> > > I'm not saying it's impossible for these interfaces to coexist if you
> > > isolate them from one another, especially when multiple dag-processors
> > > already do something similar for dags even now (isolating sets of
> objects
> > > from one another using processor_subdir value), it just feels unnatural
> > to
> > > have a declarative (dag code) and imperative (API/UI) interfaces for
> > > interacting with one type of objects.
> > >
> > > On Wed, Jan 24, 2024 at 11:35 PM Constance Martineau
> > > <[email protected]> wrote:
> > >
> > > > You're right. I didn't mean to say that the Connections and Datasets
> > > > facilitate the same thing - they don't. I meant that Connections are
> > also
> > > > "useless" if no task is using that Connection - but we allow them to
> be
> > > > created independently of dags. From that angle - I don't see how
> > allowing
> > > > Datasets to be created independently is any different.
> > > >
> > > > Also happy to hear from others about this.
> > > >
> > > > On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk <[email protected]>
> wrote:
> > > >
> > > > > I'd love to hear what others - especially those who are involved in
> > > > dataset
> > > > > creation and discussion more than me. I personally believe that
> > > > > conceptually connections and datasets are as far from each other as
> > > > > possible (I have no idea where the similarities of connections -
> > which
> > > > are
> > > > > essentially static configuration of credentials) and datasets
> (which
> > > are
> > > > > dynamic reflection of data being passed live between tasks) comes
> > from.
> > > > The
> > > > > only similarity I see is that they are both stored by Airflow in
> some
> > > > table
> > > > > (and even not that if you use SecretsManager). So comparing those
> two
> > > is
> > > > an
> > > > > apple to pear comparison if you ask me.
> > > > >
> > > > > But (despite my 4 years experience of creating Airflow) my actual
> > > > > experience with Datasets is limited, I've been mainly observing
> what
> > > was
> > > > > going on, so I would love to hear what those who created (and
> > continue
> > > to
> > > > > think about future of) the datasets :).
> > > > >
> > > > > J,
> > > > >
> > > > > On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Right. That is why I was trying to make a distinction in the PR
> and
> > > in
> > > > > this
> > > > > > discussion between CRUD-ing Dataset Objects/Definitions vs
> creating
> > > and
> > > > > > deleting Dataset Events from the queue. Happy to standardize on
> > > > whatever
> > > > > > terminology to make sure things are understood and we can have a
> > > > > productive
> > > > > > conversation.
> > > > > >
> > > > > > For Dataset Events - creating, reading and deleting them via API
> is
> > > > IMHO
> > > > > > not controversial.
> > > > > > - For creating: This has been discussed in various places, and
> that
> > > the
> > > > > > endpoint could be used to trigger dependent dags
> > > > > > - For deleting: It is easy for DAGs with multiple upstream
> > > dependencies
> > > > > to
> > > > > > go out of sync, and there is no way to recover from that without
> > > > > > manipulating the DB directory. See here
> > > > > > <https://github.com/apache/airflow/discussions/36618> and here
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801
> > > > > > >
> > > > > >
> > > > > > For CRUD-ing Dataset Definitions via API:
> > > > > >
> > > > > > > IMHO Airflow should only manage it's own entities and at most
> it
> > > > should
> > > > > > > emit events (dataset listeners, openlineage API) to inform
> others
> > > > about
> > > > > > > state changes of things that Airflow manages, but it should not
> > be
> > > > > abused
> > > > > > > to store "other" datasets, that Airflow DAGs know nothing
> about.
> > > > > >
> > > > > >
> > > > > > I disagree that it is an abuse. If I as an internal data producer
> > > > > publish a
> > > > > > dataset that I expect internal Airflow users to use, it is not
> > > abusing
> > > > > > Airflow to create a dataset and make it visible in Airflow. At
> some
> > > > point
> > > > > > in the near future, users will start referencing them in their
> > dags -
> > > > > it's
> > > > > > just a sequencing question. We don't enforce connections being
> tied
> > > to
> > > > a
> > > > > > dag - and conceptually - this is no different. It is also no
> > > different
> > > > > than
> > > > > > adding the definition as part of a dag file and having that
> dataset
> > > > show
> > > > > up
> > > > > > in the dataset list, without forcing it to be a task output as
> part
> > > of
> > > > a
> > > > > > dag. The only valid reason to now allow it IMHO is because they
> > were
> > > > > > designed to be defined within a dag file, similar to a dag, and
> we
> > > > don't
> > > > > > want to deal with the impediment I laid out.
> > > > > >
> > > > > > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > > I also think it makes sense to allow people to
> > > create/update/delete
> > > > > > > > Datasets via the API and eventually UI. Even if the dataset
> is
> > > not
> > > > > > > > initially connected to a DAG, it's nice to be able to see in
> > one
> > > > > place
> > > > > > > all
> > > > > > > > the datasets and ML models that my dags can leverage. We
> allow
> > > > people
> > > > > > to
> > > > > > > > create Connections and Variables via the API and UI without
> > > forcing
> > > > > > users
> > > > > > > > to use them as part of a task or dag. This isn't any
> different
> > > from
> > > > > > that
> > > > > > > > aspect.
> > > > > > > >
> > > > > > > > Airflow has some objects that cab
> > > > > > > > > be created by a dag processor (Dags, Datasets) and others
> > that
> > > > can
> > > > > be
> > > > > > > > > created with API/UI (Connections, Variables)
> > > > > > > >
> > > > > > > >
> > > > > > > A comment from my side. I think there is a big conceptual
> > > difference
> > > > > here
> > > > > > > that you yourself noticed - DAG code - via DAGProcessor -
> creates
> > > DAG
> > > > > and
> > > > > > > DataSets, and UI/API can allow to create and modify
> > > > > Connections/Variables
> > > > > > > that are then USED (but never created) by DAG code. This is why
> > > > while I
> > > > > > see
> > > > > > > no fundamental security blocker with "Creating" Datasets via
> API
> > -
> > > it
> > > > > > > definitely feels out-of-place to be able to manage them via
> API.
> > > > > > >
> > > > > > > And following the discussion from the PR -  Yes, we should
> > discuss
> > > > > > create,
> > > > > > > update and delete differently. Because conceptually they are
> NOT
> > > > > typical
> > > > > > > CRUD (which the Connection / Variables API UI is).
> > > > > > > I think there is a huge difference between "Updating" and
> > > "Deleting"
> > > > > > > datasets via the API and the `UD` in CRUD:
> > > > > > >
> > > > > > > * Updating dataset does not actually "update" its definition,
> it
> > > > > informs
> > > > > > > those who listen on dataset that it has changed. No more, no
> > less.
> > > > > > > Typically when you have CRUD operation, you pass the same data
> in
> > > "C"
> > > > > and
> > > > > > > "U" - but in our case those two operations are different and
> > serve
> > > > > > > different purposes
> > > > > > > * Deleting the dataset is also not what "D" in CRUD is - in
> this
> > > case
> > > > > it
> > > > > > is
> > > > > > > mostly a "retention". And there are some very specific things
> > here.
> > > > > > Should
> > > > > > > we delete a dataset that some of the DAGs still have as
> > > input/output
> > > > ?
> > > > > > IMHO
> > > > > > > - absolutely not. But .... How do we know that? If we have only
> > > DAGs,
> > > > > > > implicitly creating Datasets by declaring whether they are used
> > or
> > > > not
> > > > > we
> > > > > > > can easily know that by reference counting. But when we allow
> the
> > > > > > creation
> > > > > > > of the datasets via API - it's no longer that obvious and the
> > > number
> > > > of
> > > > > > > cases to handle gets really big.
> > > > > > >
> > > > > > > After seeing the comments and discussion - I believe it's not a
> > > good
> > > > > idea
> > > > > > > to allow external Dataset creations, the use case does not
> > justify
> > > it
> > > > > > IMHO.
> > > > > > >
> > > > > > > Why ?
> > > > > > >
> > > > > > > We do not want Airflow to become a "dataset metadata storage"
> > that
> > > > you
> > > > > > can
> > > > > > > query/update and find out what all kinds of datasets the whole
> > > <data
> > > > > > lake>
> > > > > > > of yours has - this is not the purpose of Airflow, and will
> never
> > > be
> > > > > > IMHO.
> > > > > > > It's a non-goal for Airflow to keep "other" datasets.
> > > > > > >
> > > > > > > IMHO Airflow should only manage it's own entities and at most
> it
> > > > should
> > > > > > > emit events (dataset listeners, openlineage API) to inform
> others
> > > > about
> > > > > > > state changes of things that Airflow manages, but it should not
> > be
> > > > > abused
> > > > > > > to store "other" datasets, that Airflow DAGs know nothing
> about.
> > > > This -
> > > > > > in
> > > > > > > a way contradicts the "Airflow as a Platform" approach of ours
> > and
> > > > the
> > > > > > > whole concept of OpenLineage integration of Airflow. If you
> want
> > to
> > > > > have
> > > > > > > single place where you store all the datasets you manage are,
> > have
> > > > all
> > > > > > your
> > > > > > > components emit open-lineage events and use a dedicated
> solution
> > > > > > (Marquez,
> > > > > > > Amundsen, Google Data Catalog etc. etc. ) - all of the serious
> > ones
> > > > now
> > > > > > > consume Open Lineage events that pretty much all serious
> > components
> > > > > > already
> > > > > > > emit - and there you can have it all. This is our strategic
> > > > direction -
> > > > > > and
> > > > > > > this is why we accepted AIP-53 Open Lineage:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > > > > > > .
> > > > > > > At the moment we accepted it, we also accepted the fact that
> > > Airflow
> > > > is
> > > > > > > just a producer of lineage data, not a storage nor consumer of
> > it -
> > > > > > because
> > > > > > > this is the scope of AIP-53.
> > > > > > >
> > > > > > > I think the only way a dataset should be created in Airflow DB
> is
> > > via
> > > > > > > DagFileProcessor. With reference counting eventually and
> removal
> > of
> > > > > > > datasets that are not used by anyone any more possibly - if we
> > > decide
> > > > > we
> > > > > > do
> > > > > > > not want to keep old datasets in DB. That should be it IMHO.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Constance Martineau
> > > > > > > > Senior Product Manager
> > > > > > > >
> > > > > > > > Email: [email protected]
> > > > > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > > > > > >
> > > > > > > >
> > > > > > > > <https://www.astronomer.io/>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Constance Martineau
> > > > > > Senior Product Manager
> > > > > >
> > > > > > Email: [email protected]
> > > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > > > >
> > > > > >
> > > > > > <https://www.astronomer.io/>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Constance Martineau
> > > > Senior Product Manager
> > > >
> > > > Email: [email protected]
> > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > >
> > > >
> > > > <https://www.astronomer.io/>
> > > >
> > >
> > >
> > > --
> > > პატივისცემით,
> > > თორნიკე გურგენიძე,
> > > ESM-ის მესამე კურსის სტუდენტი, XI ჯგუფი
> > >
> >
>

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to