Thanks, Potiuk, for highlighting the importance of aligning new features
with Airflow's roadmap. I agree we need to be cautious about expanding
dataset functionalities in ways that might conflict with existing or
planned features. However, this approach doesn't necessarily transform
Airflow into a 'dataset metadata storage' but rather enhances its role as a
centralized orchestrator, making datasets more visible and manageable.

Tornike G., you raise a valid concern about mixing declarative and
imperative approaches. We need to think carefully about how API-created
datasets would coexist with those defined in DAG files. However, in my
opinion, this is a natural transition that will likely become necessary as
Airflow is used in increasingly diverse environments and organizations, a
shift that seems inevitable.

Constance M., your perspective on enabling API/UI management for datasets
is spot-on. It adds a layer of flexibility and visibility that's crucial
for modern data orchestration, aligning well with Airflow's goals of being
a comprehensive workflow platform without overstepping its primary
functions.

To add my perspective, echoing some of what I posted in the GH discussion (
https://github.com/apache/airflow/discussions/36723): Data-aware scheduling
was a transformative step for Airflow because it acknowledged data as the
primary workflow trigger. This proposal is essentially an extension of that
concept, further decoupling Airflow from the assumption that only DAGs can
influence datasets. I also believe it aligns with the modern data
engineering practices where workflows are increasingly driven by data
events and think this is particularly interesting for larger organizations
where datasets frequently span across various systems and teams.


On Wed, Jan 24, 2024 at 8:53 PM Tornike Gurgenidze <togur...@freeuni.edu.ge>
wrote:

> What I meant by update/delete operations was referring to Dataset objects
> themselves, not DatasetEvents. I also see no issue in allowing dataset
> changes to be registered externally. I admit that deleting datasets is
> probably irrelevant as even now they are not deleted, but instead orphaned
> after reference counting, but U in CRUD is still very much relevant imho.
> There's a field called extra in DatasetModel for example which has no use
> inside airflow, but it still might be used from user code in all sorts of
> ways.
>
> I'm not saying it's impossible for these interfaces to coexist if you
> isolate them from one another, especially when multiple dag-processors
> already do something similar for dags even now (isolating sets of objects
> from one another using processor_subdir value), it just feels unnatural to
> have a declarative (dag code) and imperative (API/UI) interfaces for
> interacting with one type of objects.
>
> On Wed, Jan 24, 2024 at 11:35 PM Constance Martineau
> <consta...@astronomer.io.invalid> wrote:
>
> > You're right. I didn't mean to say that the Connections and Datasets
> > facilitate the same thing - they don't. I meant that Connections are also
> > "useless" if no task is using that Connection - but we allow them to be
> > created independently of dags. From that angle - I don't see how allowing
> > Datasets to be created independently is any different.
> >
> > Also happy to hear from others about this.
> >
> > On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > > I'd love to hear what others - especially those who are involved in
> > dataset
> > > creation and discussion more than me. I personally believe that
> > > conceptually connections and datasets are as far from each other as
> > > possible (I have no idea where the similarities of connections - which
> > are
> > > essentially static configuration of credentials) and datasets (which
> are
> > > dynamic reflection of data being passed live between tasks) comes from.
> > The
> > > only similarity I see is that they are both stored by Airflow in some
> > table
> > > (and even not that if you use SecretsManager). So comparing those two
> is
> > an
> > > apple to pear comparison if you ask me.
> > >
> > > But (despite my 4 years experience of creating Airflow) my actual
> > > experience with Datasets is limited, I've been mainly observing what
> was
> > > going on, so I would love to hear what those who created (and continue
> to
> > > think about future of) the datasets :).
> > >
> > > J,
> > >
> > > On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau
> > > <consta...@astronomer.io.invalid> wrote:
> > >
> > > > Right. That is why I was trying to make a distinction in the PR and
> in
> > > this
> > > > discussion between CRUD-ing Dataset Objects/Definitions vs creating
> and
> > > > deleting Dataset Events from the queue. Happy to standardize on
> > whatever
> > > > terminology to make sure things are understood and we can have a
> > > productive
> > > > conversation.
> > > >
> > > > For Dataset Events - creating, reading and deleting them via API is
> > IMHO
> > > > not controversial.
> > > > - For creating: This has been discussed in various places, and that
> the
> > > > endpoint could be used to trigger dependent dags
> > > > - For deleting: It is easy for DAGs with multiple upstream
> dependencies
> > > to
> > > > go out of sync, and there is no way to recover from that without
> > > > manipulating the DB directory. See here
> > > > <https://github.com/apache/airflow/discussions/36618> and here
> > > > <
> > > >
> > >
> >
> https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801
> > > > >
> > > >
> > > > For CRUD-ing Dataset Definitions via API:
> > > >
> > > > > IMHO Airflow should only manage it's own entities and at most it
> > should
> > > > > emit events (dataset listeners, openlineage API) to inform others
> > about
> > > > > state changes of things that Airflow manages, but it should not be
> > > abused
> > > > > to store "other" datasets, that Airflow DAGs know nothing about.
> > > >
> > > >
> > > > I disagree that it is an abuse. If I as an internal data producer
> > > publish a
> > > > dataset that I expect internal Airflow users to use, it is not
> abusing
> > > > Airflow to create a dataset and make it visible in Airflow. At some
> > point
> > > > in the near future, users will start referencing them in their dags -
> > > it's
> > > > just a sequencing question. We don't enforce connections being tied
> to
> > a
> > > > dag - and conceptually - this is no different. It is also no
> different
> > > than
> > > > adding the definition as part of a dag file and having that dataset
> > show
> > > up
> > > > in the dataset list, without forcing it to be a task output as part
> of
> > a
> > > > dag. The only valid reason to now allow it IMHO is because they were
> > > > designed to be defined within a dag file, similar to a dag, and we
> > don't
> > > > want to deal with the impediment I laid out.
> > > >
> > > > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com>
> > wrote:
> > > >
> > > > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> > > > > <consta...@astronomer.io.invalid> wrote:
> > > > >
> > > > > > I also think it makes sense to allow people to
> create/update/delete
> > > > > > Datasets via the API and eventually UI. Even if the dataset is
> not
> > > > > > initially connected to a DAG, it's nice to be able to see in one
> > > place
> > > > > all
> > > > > > the datasets and ML models that my dags can leverage. We allow
> > people
> > > > to
> > > > > > create Connections and Variables via the API and UI without
> forcing
> > > > users
> > > > > > to use them as part of a task or dag. This isn't any different
> from
> > > > that
> > > > > > aspect.
> > > > > >
> > > > > > Airflow has some objects that cab
> > > > > > > be created by a dag processor (Dags, Datasets) and others that
> > can
> > > be
> > > > > > > created with API/UI (Connections, Variables)
> > > > > >
> > > > > >
> > > > > A comment from my side. I think there is a big conceptual
> difference
> > > here
> > > > > that you yourself noticed - DAG code - via DAGProcessor - creates
> DAG
> > > and
> > > > > DataSets, and UI/API can allow to create and modify
> > > Connections/Variables
> > > > > that are then USED (but never created) by DAG code. This is why
> > while I
> > > > see
> > > > > no fundamental security blocker with "Creating" Datasets via API -
> it
> > > > > definitely feels out-of-place to be able to manage them via API.
> > > > >
> > > > > And following the discussion from the PR -  Yes, we should discuss
> > > > create,
> > > > > update and delete differently. Because conceptually they are NOT
> > > typical
> > > > > CRUD (which the Connection / Variables API UI is).
> > > > > I think there is a huge difference between "Updating" and
> "Deleting"
> > > > > datasets via the API and the `UD` in CRUD:
> > > > >
> > > > > * Updating dataset does not actually "update" its definition, it
> > > informs
> > > > > those who listen on dataset that it has changed. No more, no less.
> > > > > Typically when you have CRUD operation, you pass the same data in
> "C"
> > > and
> > > > > "U" - but in our case those two operations are different and serve
> > > > > different purposes
> > > > > * Deleting the dataset is also not what "D" in CRUD is - in this
> case
> > > it
> > > > is
> > > > > mostly a "retention". And there are some very specific things here.
> > > > Should
> > > > > we delete a dataset that some of the DAGs still have as
> input/output
> > ?
> > > > IMHO
> > > > > - absolutely not. But .... How do we know that? If we have only
> DAGs,
> > > > > implicitly creating Datasets by declaring whether they are used or
> > not
> > > we
> > > > > can easily know that by reference counting. But when we allow the
> > > > creation
> > > > > of the datasets via API - it's no longer that obvious and the
> number
> > of
> > > > > cases to handle gets really big.
> > > > >
> > > > > After seeing the comments and discussion - I believe it's not a
> good
> > > idea
> > > > > to allow external Dataset creations, the use case does not justify
> it
> > > > IMHO.
> > > > >
> > > > > Why ?
> > > > >
> > > > > We do not want Airflow to become a "dataset metadata storage" that
> > you
> > > > can
> > > > > query/update and find out what all kinds of datasets the whole
> <data
> > > > lake>
> > > > > of yours has - this is not the purpose of Airflow, and will never
> be
> > > > IMHO.
> > > > > It's a non-goal for Airflow to keep "other" datasets.
> > > > >
> > > > > IMHO Airflow should only manage it's own entities and at most it
> > should
> > > > > emit events (dataset listeners, openlineage API) to inform others
> > about
> > > > > state changes of things that Airflow manages, but it should not be
> > > abused
> > > > > to store "other" datasets, that Airflow DAGs know nothing about.
> > This -
> > > > in
> > > > > a way contradicts the "Airflow as a Platform" approach of ours and
> > the
> > > > > whole concept of OpenLineage integration of Airflow. If you want to
> > > have
> > > > > single place where you store all the datasets you manage are, have
> > all
> > > > your
> > > > > components emit open-lineage events and use a dedicated solution
> > > > (Marquez,
> > > > > Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones
> > now
> > > > > consume Open Lineage events that pretty much all serious components
> > > > already
> > > > > emit - and there you can have it all. This is our strategic
> > direction -
> > > > and
> > > > > this is why we accepted AIP-53 Open Lineage:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > > > > .
> > > > > At the moment we accepted it, we also accepted the fact that
> Airflow
> > is
> > > > > just a producer of lineage data, not a storage nor consumer of it -
> > > > because
> > > > > this is the scope of AIP-53.
> > > > >
> > > > > I think the only way a dataset should be created in Airflow DB is
> via
> > > > > DagFileProcessor. With reference counting eventually and removal of
> > > > > datasets that are not used by anyone any more possibly - if we
> decide
> > > we
> > > > do
> > > > > not want to keep old datasets in DB. That should be it IMHO.
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Constance Martineau
> > > > > > Senior Product Manager
> > > > > >
> > > > > > Email: consta...@astronomer.io
> > > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > > > >
> > > > > >
> > > > > > <https://www.astronomer.io/>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Constance Martineau
> > > > Senior Product Manager
> > > >
> > > > Email: consta...@astronomer.io
> > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > >
> > > >
> > > > <https://www.astronomer.io/>
> > > >
> > >
> >
> >
> > --
> >
> > Constance Martineau
> > Senior Product Manager
> >
> > Email: consta...@astronomer.io
> > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> >
> >
> > <https://www.astronomer.io/>
> >
>
>
> --
> პატივისცემით,
> თორნიკე გურგენიძე,
> ESM-ის მესამე კურსის სტუდენტი, XI ჯგუფი
>

Reply via email to