What I meant by update/delete operations was referring to Dataset objects
themselves, not DatasetEvents. I also see no issue in allowing dataset
changes to be registered externally. I admit that deleting datasets is
probably irrelevant as even now they are not deleted, but instead orphaned
after reference counting, but U in CRUD is still very much relevant imho.
There's a field called extra in DatasetModel for example which has no use
inside airflow, but it still might be used from user code in all sorts of
ways.

I'm not saying it's impossible for these interfaces to coexist if you
isolate them from one another, especially when multiple dag-processors
already do something similar for dags even now (isolating sets of objects
from one another using processor_subdir value), it just feels unnatural to
have a declarative (dag code) and imperative (API/UI) interfaces for
interacting with one type of objects.

On Wed, Jan 24, 2024 at 11:35 PM Constance Martineau
<consta...@astronomer.io.invalid> wrote:

> You're right. I didn't mean to say that the Connections and Datasets
> facilitate the same thing - they don't. I meant that Connections are also
> "useless" if no task is using that Connection - but we allow them to be
> created independently of dags. From that angle - I don't see how allowing
> Datasets to be created independently is any different.
>
> Also happy to hear from others about this.
>
> On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > I'd love to hear what others - especially those who are involved in
> dataset
> > creation and discussion more than me. I personally believe that
> > conceptually connections and datasets are as far from each other as
> > possible (I have no idea where the similarities of connections - which
> are
> > essentially static configuration of credentials) and datasets (which are
> > dynamic reflection of data being passed live between tasks) comes from.
> The
> > only similarity I see is that they are both stored by Airflow in some
> table
> > (and even not that if you use SecretsManager). So comparing those two is
> an
> > apple to pear comparison if you ask me.
> >
> > But (despite my 4 years experience of creating Airflow) my actual
> > experience with Datasets is limited, I've been mainly observing what was
> > going on, so I would love to hear what those who created (and continue to
> > think about future of) the datasets :).
> >
> > J,
> >
> > On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau
> > <consta...@astronomer.io.invalid> wrote:
> >
> > > Right. That is why I was trying to make a distinction in the PR and in
> > this
> > > discussion between CRUD-ing Dataset Objects/Definitions vs creating and
> > > deleting Dataset Events from the queue. Happy to standardize on
> whatever
> > > terminology to make sure things are understood and we can have a
> > productive
> > > conversation.
> > >
> > > For Dataset Events - creating, reading and deleting them via API is
> IMHO
> > > not controversial.
> > > - For creating: This has been discussed in various places, and that the
> > > endpoint could be used to trigger dependent dags
> > > - For deleting: It is easy for DAGs with multiple upstream dependencies
> > to
> > > go out of sync, and there is no way to recover from that without
> > > manipulating the DB directory. See here
> > > <https://github.com/apache/airflow/discussions/36618> and here
> > > <
> > >
> >
> https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801
> > > >
> > >
> > > For CRUD-ing Dataset Definitions via API:
> > >
> > > > IMHO Airflow should only manage it's own entities and at most it
> should
> > > > emit events (dataset listeners, openlineage API) to inform others
> about
> > > > state changes of things that Airflow manages, but it should not be
> > abused
> > > > to store "other" datasets, that Airflow DAGs know nothing about.
> > >
> > >
> > > I disagree that it is an abuse. If I as an internal data producer
> > publish a
> > > dataset that I expect internal Airflow users to use, it is not abusing
> > > Airflow to create a dataset and make it visible in Airflow. At some
> point
> > > in the near future, users will start referencing them in their dags -
> > it's
> > > just a sequencing question. We don't enforce connections being tied to
> a
> > > dag - and conceptually - this is no different. It is also no different
> > than
> > > adding the definition as part of a dag file and having that dataset
> show
> > up
> > > in the dataset list, without forcing it to be a task output as part of
> a
> > > dag. The only valid reason to now allow it IMHO is because they were
> > > designed to be defined within a dag file, similar to a dag, and we
> don't
> > > want to deal with the impediment I laid out.
> > >
> > > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > >
> > > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> > > > <consta...@astronomer.io.invalid> wrote:
> > > >
> > > > > I also think it makes sense to allow people to create/update/delete
> > > > > Datasets via the API and eventually UI. Even if the dataset is not
> > > > > initially connected to a DAG, it's nice to be able to see in one
> > place
> > > > all
> > > > > the datasets and ML models that my dags can leverage. We allow
> people
> > > to
> > > > > create Connections and Variables via the API and UI without forcing
> > > users
> > > > > to use them as part of a task or dag. This isn't any different from
> > > that
> > > > > aspect.
> > > > >
> > > > > Airflow has some objects that cab
> > > > > > be created by a dag processor (Dags, Datasets) and others that
> can
> > be
> > > > > > created with API/UI (Connections, Variables)
> > > > >
> > > > >
> > > > A comment from my side. I think there is a big conceptual difference
> > here
> > > > that you yourself noticed - DAG code - via DAGProcessor - creates DAG
> > and
> > > > DataSets, and UI/API can allow to create and modify
> > Connections/Variables
> > > > that are then USED (but never created) by DAG code. This is why
> while I
> > > see
> > > > no fundamental security blocker with "Creating" Datasets via API - it
> > > > definitely feels out-of-place to be able to manage them via API.
> > > >
> > > > And following the discussion from the PR -  Yes, we should discuss
> > > create,
> > > > update and delete differently. Because conceptually they are NOT
> > typical
> > > > CRUD (which the Connection / Variables API UI is).
> > > > I think there is a huge difference between "Updating" and "Deleting"
> > > > datasets via the API and the `UD` in CRUD:
> > > >
> > > > * Updating dataset does not actually "update" its definition, it
> > informs
> > > > those who listen on dataset that it has changed. No more, no less.
> > > > Typically when you have CRUD operation, you pass the same data in "C"
> > and
> > > > "U" - but in our case those two operations are different and serve
> > > > different purposes
> > > > * Deleting the dataset is also not what "D" in CRUD is - in this case
> > it
> > > is
> > > > mostly a "retention". And there are some very specific things here.
> > > Should
> > > > we delete a dataset that some of the DAGs still have as input/output
> ?
> > > IMHO
> > > > - absolutely not. But .... How do we know that? If we have only DAGs,
> > > > implicitly creating Datasets by declaring whether they are used or
> not
> > we
> > > > can easily know that by reference counting. But when we allow the
> > > creation
> > > > of the datasets via API - it's no longer that obvious and the number
> of
> > > > cases to handle gets really big.
> > > >
> > > > After seeing the comments and discussion - I believe it's not a good
> > idea
> > > > to allow external Dataset creations, the use case does not justify it
> > > IMHO.
> > > >
> > > > Why ?
> > > >
> > > > We do not want Airflow to become a "dataset metadata storage" that
> you
> > > can
> > > > query/update and find out what all kinds of datasets the whole <data
> > > lake>
> > > > of yours has - this is not the purpose of Airflow, and will never be
> > > IMHO.
> > > > It's a non-goal for Airflow to keep "other" datasets.
> > > >
> > > > IMHO Airflow should only manage it's own entities and at most it
> should
> > > > emit events (dataset listeners, openlineage API) to inform others
> about
> > > > state changes of things that Airflow manages, but it should not be
> > abused
> > > > to store "other" datasets, that Airflow DAGs know nothing about.
> This -
> > > in
> > > > a way contradicts the "Airflow as a Platform" approach of ours and
> the
> > > > whole concept of OpenLineage integration of Airflow. If you want to
> > have
> > > > single place where you store all the datasets you manage are, have
> all
> > > your
> > > > components emit open-lineage events and use a dedicated solution
> > > (Marquez,
> > > > Amundsen, Google Data Catalog etc. etc. ) - all of the serious ones
> now
> > > > consume Open Lineage events that pretty much all serious components
> > > already
> > > > emit - and there you can have it all. This is our strategic
> direction -
> > > and
> > > > this is why we accepted AIP-53 Open Lineage:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > > > .
> > > > At the moment we accepted it, we also accepted the fact that Airflow
> is
> > > > just a producer of lineage data, not a storage nor consumer of it -
> > > because
> > > > this is the scope of AIP-53.
> > > >
> > > > I think the only way a dataset should be created in Airflow DB is via
> > > > DagFileProcessor. With reference counting eventually and removal of
> > > > datasets that are not used by anyone any more possibly - if we decide
> > we
> > > do
> > > > not want to keep old datasets in DB. That should be it IMHO.
> > > >
> > > >
> > > >
> > > > >
> > > > > --
> > > > >
> > > > > Constance Martineau
> > > > > Senior Product Manager
> > > > >
> > > > > Email: consta...@astronomer.io
> > > > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > > > >
> > > > >
> > > > > <https://www.astronomer.io/>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Constance Martineau
> > > Senior Product Manager
> > >
> > > Email: consta...@astronomer.io
> > > Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
> > >
> > >
> > > <https://www.astronomer.io/>
> > >
> >
>
>
> --
>
> Constance Martineau
> Senior Product Manager
>
> Email: consta...@astronomer.io
> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>
>
> <https://www.astronomer.io/>
>


-- 
პატივისცემით,
თორნიკე გურგენიძე,
ESM-ის მესამე კურსის სტუდენტი, XI ჯგუფი

Reply via email to