I've had a few conversations with Astronomer customers within the past few
days who are looking for an approved way to create datasets outside of the
dag parsing process. They are already - or are considering - using some
sort of custom process similar to what Steve suggested in the github
discussion
<https://github.com/apache/airflow/discussions/36723#discussioncomment-8243269>.


Given those conversations and the feedback from PRs, Github Discussions,
and this dev thread, I appreciate that there's a need that Airflow isn't
filling today. To gather more support, we need a proper answer about how we
will deal with clashes between the imperative and declarative approach. As
a Product Manager - I do not have the skillset to figure this out on my own
- but would be happy to work with someone in the community on this.

On Thu, Jan 25, 2024 at 9:58 AM Eduardo Nicastro <edu.nicas...@gmail.com>
wrote:

> Thanks, Potiuk, for highlighting the importance of aligning new features
> with Airflow's roadmap. I agree we need to be cautious about expanding
> dataset functionalities in ways that might conflict with existing or
> planned features. However, this approach doesn't necessarily transform
> Airflow into a 'dataset metadata storage' but rather enhances its role as a
> centralized orchestrator, making datasets more visible and manageable.
>
> Tornike G., you raise a valid concern about mixing declarative and
> imperative approaches. We need to think carefully about how API-created
> datasets would coexist with those defined in DAG files. However, in my
> opinion, this is a natural transition that will likely become necessary as
> Airflow is used in increasingly diverse environments and organizations, a
> shift that seems inevitable.
>
> Constance M., your perspective on enabling API/UI management for datasets
> is spot-on. It adds a layer of flexibility and visibility that's crucial
> for modern data orchestration, aligning well with Airflow's goals of being
> a comprehensive workflow platform without overstepping its primary
> functions.
>
> To add my perspective, echoing some of what I posted in the GH discussion (
> https://github.com/apache/airflow/discussions/36723): Data-aware
> scheduling
> was a transformative step for Airflow because it acknowledged data as the
> primary workflow trigger. This proposal is essentially an extension of that
> concept, further decoupling Airflow from the assumption that only DAGs can
> influence datasets. I also believe it aligns with the modern data
> engineering practices where workflows are increasingly driven by data
> events and think this is particularly interesting for larger organizations
> where datasets frequently span across various systems and teams.
>
>
> On Wed, Jan 24, 2024 at 8:53 PM Tornike Gurgenidze <
> togur...@freeuni.edu.ge>
> wrote:
>
> > What I meant by update/delete operations was referring to Dataset objects
> > themselves, not DatasetEvents. I also see no issue in allowing dataset
> > changes to be registered externally. I admit that deleting datasets is
> > probably irrelevant as even now they are not deleted, but instead
> orphaned
> > after reference counting, but U in CRUD is still very much relevant imho.
> > There's a field called extra in DatasetModel for example which has no use
> > inside airflow, but it still might be used from user code in all sorts of
> > ways.
> >
> > I'm not saying it's impossible for these interfaces to coexist if you
> > isolate them from one another, especially when multiple dag-processors
> > already do something similar for dags even now (isolating sets of objects
> > from one another using processor_subdir value), it just feels unnatural
> to
> > have a declarative (dag code) and imperative (API/UI) interfaces for
> > interacting with one type of objects.
> >
> > On Wed, Jan 24, 2024 at 11:35 PM Constance Martineau
> > <consta...@astronomer.io.invalid> wrote:
> >
> > > You're right. I didn't mean to say that the Connections and Datasets
> > > facilitate the same thing - they don't. I meant that Connections are
> also
> > > "useless" if no task is using that Connection - but we allow them to be
> > > created independently of dags. From that angle - I don't see how
> allowing
> > > Datasets to be created independently is any different.
> > >
> > > Also happy to hear from others about this.
> > >
> > > On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > > I'd love to hear what others - especially those who are involved in
> > > dataset
> > > > creation and discussion more than me. I personally believe that
> > > > conceptually connections and datasets are as far from each other as
> > > > possible (I have no idea where the similarities of connections -
> which
> > > are
> > > > essentially static configuration of credentials) and datasets (which
> > are
> > > > dynamic reflection of data being passed live between tasks) comes
> from.
> > > The
> > > > only similarity I see is that they are both stored by Airflow in some
> > > table
> > > > (and even not that if you use SecretsManager). So comparing those two
> > is
> > > an
> > > > apple to pear comparison if you ask me.
> > > >
> > > > But (despite my 4 years experience of creating Airflow) my actual
> > > > experience with Datasets is limited, I've been mainly observing what
> > was
> > > > going on, so I would love to hear what those who created (and
> continue
> > to
> > > > think about future of) the datasets :).
> > > >
> > > > J,
> > > >
> > > > On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau
> > > > <consta...@astronomer.io.invalid> wrote:
> > > >
> > > > > Right. That is why I was trying to make a distinction in the PR and
> > in
> > > > this
> > > > > discussion between CRUD-ing Dataset Objects/Definitions vs creating
> > and
> > > > > deleting Dataset Events from the queue. Happy to standardize on
> > > whatever
> > > > > terminology to make sure things are understood and we can have a
> > > > productive
> > > > > conversation.
> > > > >
> > > > > For Dataset Events - creating, reading and deleting them via API is
> > > IMHO
> > > > > not controversial.
> > > > > - For creating: This has been discussed in various places, and that
> > the
> > > > > endpoint could be used to trigger dependent dags
> > > > > - For deleting: It is easy for DAGs with multiple upstream
> > dependencies
> > > > to
> > > > > go out of sync, and there is no way to recover from that without
> > > > > manipulating the DB directory. See here
> > > > > <https://github.com/apache/airflow/discussions/36618> and here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801
> > > > > >
> > > > >
> > > > > For CRUD-ing Dataset Definitions via API:
> > > > >
> > > > > > IMHO Airflow should only manage it's own entities and at most it
> > > should
> > > > > > emit events (dataset listeners, openlineage API) to inform others
> > > about
> > > > > > state changes of things that Airflow manages, but it should not
> be
> > > > abused
> > > > > > to store "other" datasets, that Airflow DAGs know nothing about.
> > > > >
> > > > >
> > > > > I disagree that it is an abuse. If I as an internal data producer
> > > > publish a
> > > > > dataset that I expect internal Airflow users to use, it is not
> > abusing
> > > > > Airflow to create a dataset and make it visible in Airflow. At some
> > > point
> > > > > in the near future, users will start referencing them in their
> dags -
> > > > it's
> > > > > just a sequencing question. We don't enforce connections being tied
> > to
> > > a
> > > > > dag - and conceptually - this is no different. It is also no
> > different
> > > > than
> > > > > adding the definition as part of a dag file and having that dataset
> > > show
> > > > up
> > > > > in the dataset list, without forcing it to be a task output as part
> > of
> > > a
> > > > > dag. The only valid reason to now allow it IMHO is because they
> were
> > > > > designed to be defined within a dag file, similar to a dag, and we
> > > don't
> > > > > want to deal with the impediment I laid out.
> > > > >
> > > > > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com>
> > > wrote:
> > > > >
> > > > > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> > > > > > <consta...@astronomer.io.invalid> wrote:
> > > > > >
> > > > > > > I also think it makes sense to allow people to
> > create/update/delete
> > > > > > > Datasets via the API and eventually UI. Even if the dataset is
> > not
> > > > > > > initially connected to a DAG, it's nice to be able to see in
> one
> > > > place
> > > > > > all
> > > > > > > the datasets and ML models that my dags can leverage. We allow
> > > people
> > > > > to
> > > > > > > create Connections and Variables via the API and UI without
> > forcing
> > > > > users
> > > > > > > to use them as part of a task or dag. This isn't any different
> > from
> > > > > that
> > > > > > > aspect.
> > > > > > >
> > > > > > > Airflow has some objects that cab
> > > > > > > > be created by a dag processor (Dags, Datasets) and others
> that
> > > can
> > > > be
> > > > > > > > created with API/UI (Connections, Variables)
> > > > > > >
> > > > > > >
> > > > > > A comment from my side. I think there is a big conceptual
> > difference
> > > > here
> > > > > > that you yourself noticed - DAG code - via DAGProcessor - creates
> > DAG
> > > > and
> > > > > > DataSets, and UI/API can allow to create and modify
> > > > Connections/Variables
> > > > > > that are then USED (but never created) by DAG code. This is why
> > > while I
> > > > > see
> > > > > > no fundamental security blocker with "Creating" Datasets via API
> -
> > it
> > > > > > definitely feels out-of-place to be able to manage them via API.
> > > > > >
> > > > > > And following the discussion from the PR -  Yes, we should
> discuss
> > > > > create,
> > > > > > update and delete differently. Because conceptually they are NOT
> > > > typical
> > > > > > CRUD (which the Connection / Variables API UI is).
> > > > > > I think there is a huge difference between "Updating" and
> > "Deleting"
> > > > > > datasets via the API and the `UD` in CRUD:
> > > > > >
> > > > > > * Updating dataset does not actually "update" its definition, it
> > > > informs
> > > > > > those who listen on dataset that it has changed. No more, no
> less.
> > > > > > Typically when you have CRUD operation, you pass the same data in
> > "C"
> > > > and
> > > > > > "U" - but in our case those two operations are different and
> serve
> > > > > > different purposes
> > > > > > * Deleting the dataset is also not what "D" in CRUD is - in this
> > case
> > > > it
> > > > > is
> > > > > > mostly a "retention". And there are some very specific things
> here.
> > > > > Should
> > > > > > we delete a dataset that some of the DAGs still have as
> > input/output
> > > ?
> > > > > IMHO
> > > > > > - absolutely not. But .... How do we know that? If we have only
> > DAGs,
> > > > > > implicitly creating Datasets by declaring whether they are used
> or
> > > not
> > > > we
> > > > > > can easily know that by reference counting. But when we allow the
> > > > > creation
> > > > > > of the datasets via API - it's no longer that obvious and the
> > number
> > > of
> > > > > > cases to handle gets really big.
> > > > > >
> > > > > > After seeing the comments and discussion - I believe it's not a
> > good
> > > > idea
> > > > > > to allow external Dataset creations, the use case does not
> justify
> > it
> > > > > IMHO.
> > > > > >
> > > > > > Why ?
> > > > > >
> > > > > > We do not want Airflow to become a "dataset metadata storage"
> that
> > > you
> > > > > can
> > > > > > query/update and find out what all kinds of datasets the whole
> > <data
> > > > > lake>
> > > > > > of yours has - this is not the purpose of Airflow, and will never
> > be
> > > > > IMHO.
> > > > > > It's a non-goal for Airflow to keep "other" datasets.
> > > > > >
> > > > > > IMHO Airflow should only manage it's own entities and at most it
> > > should
> > > > > > emit events (dataset listeners, openlineage API) to inform others
> > > about
> > > > > > state changes of things that Airflow manages, but it should not
> be
> > > > abused
> > > > > > to store "other" datasets, that Airflow DAGs know nothing about.
> > > This -
> > > > > in
> > > > > > a way contradicts the "Airflow as a Platform" approach of ours
> and
> > > the
> > > > > > whole concept of OpenLineage integration of Airflow. If you want
> to
> > > > have
> > > > > > single place where you store all the datasets you manage are,
> have
> > > all
> > > > > your
> > > > > > components emit open-lineage events and use a dedicated solution
> > > > > (Marquez,
> > > > > > Amundsen, Google Data Catalog etc. etc. ) - all of the serious
> ones
> > > now
> > > > > > consume Open Lineage events that pretty much all serious
> components
> > > > > already
> > > > > > emit - and there you can have it all. This is our strategic
> > > direction -
> > > > > and
> > > > > > this is why we accepted AIP-53 Open Lineage:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > > > > > .
> > > > > > At the moment we accepted it, we also accepted the fact that
> > Airflow
> > > is
> > > > > > just a producer of lineage data, not a storage nor consumer of
> it -
> > > > > because
> > > > > > this is the scope of AIP-53.
> > > > > >
> > > > > > I think the only way a dataset should be created in Airflow DB is
> > via
> > > > > > DagFileProcessor. With reference counting eventually and removal
> of
> > > > > > datasets that are not used by anyone any more possibly - if we
> > decide
> > > > we
> > > > > do
> > > > > > not want to keep old datasets in DB. That should be it IMHO.
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Constance Martineau
> > > > > > > Senior Product Manager
> > > > > > >
> > > > > > > Email: consta...@astronomer.io
> > > > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > > > > >
> > > > > > >
> > > > > > > <https://www.astronomer.io/>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Constance Martineau
> > > > > Senior Product Manager
> > > > >
> > > > > Email: consta...@astronomer.io
> > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > > >
> > > > >
> > > > > <https://www.astronomer.io/>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Constance Martineau
> > > Senior Product Manager
> > >
> > > Email: consta...@astronomer.io
> > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > >
> > >
> > > <https://www.astronomer.io/>
> > >
> >
> >
> > --
> > პატივისცემით,
> > თორნიკე გურგენიძე,
> > ESM-ის მესამე კურსის სტუდენტი, XI ჯგუფი
> >
>

Reply via email to