Thank you for starting this discussion.

At first I would say that databases should be indexed to achieve good
performances against standard queries / use cases.

The Rest API does not do any crazy things/querying (as I recall). Listing,
filtering, ordering and searching against our main tables should not bring
performance issues. I would qualify that as a 'normal' usage. (in
opposition to a special edge case for the rest API)

Also this would be beneficial beyond the web server because the CLI,
Internal RPC, maybe the scheduler or even BI tools directly plugged on the
metadata database would have better performances. (not mentioning the UI of
course which is quite important to our users)

The  API is being more and more advertised to your users as the 'official'
way to interact with the airflow database, if we want that to be adopted
widely and prevent people from accessing the db directly, having a
performant API will be a requirement.

If some endpoints are too specific and there are better ways of getting
that same information, deprecating and later on removing them all together
could be a solution. In case they would require too many specific indexes
to work properly compared to its alternate solution. (or take the
opportunity of airflow 3 and remove them directly)

>  Another argument would be "this should be the responsibility of the
cluster maintainer", since it's not part of the core operation of airflow

I do not know if this implies leaving the customisation up to the cluster
manager, but letting people interfere with the db schema scares me.
(through customisation or something) I think it should remain a private
implementation detail. But maybe we can imagine a configurable option like
`[database] optimize_db_for_read_access` or something similar that would
apply an extra batch of indexing and people can choose whether or not they
want it ?

Defining an index strategy for the db could be beneficial in the long run.
(both for the API, core and help contributors as well). If I write a new
piece of code accessing the db, how do I know if it is indexed ? How do I
know if I should add an index ? Should I index foreign keys, is that done
by default or should I explicitly do it ?

Just throwing out a few ideas, curious to know what others think.
Best




Le mer. 29 mai 2024 à 22:00, Daniel Standish
<[email protected]> a écrit :

> Historically we have added indexes as needed for the performance of airflow
> itself and not for the rest API.
>
> Lately we've observed more usage of task instances list endpoint and
> specifically filtering on end_date and / or start_date and / or
> execution_date.
>
> One line of argument goes that every possible filter in the rest API should
> have an index.  But not all users use the API at all, let alone use all
> params in the rest API.  Or even *all possible combinations?!*
>
> Another argument would be "this should be the responsibility of the cluster
> maintainer", since it's not part of the core operation of airflow
> (scheduler / tasks / webserver) and highly dependent on specific use case.
>
> Another thing to consider is sometimes there's an efficient way to get the
> TIs you're looking for already with a slight refactor.  E.g. get the dag
> runs first, then fetch the TIs using the PK.
>
> Of course adding indexes is a trade off between storage space,
> performance overhead maintaining them etc, balanced against reasonable
> usage of the REST API.
>
> I'm curious what folks think about this, and whether and when we should add
> indexes in OSS that help REST API queries.
>

Reply via email to