So let's document as part of the API which queries are not performant and
suggest users that want to use them to make their analytics queries
elsewhere. I'd very much prefer that it's slow "by design" for everyone
rather than add option for the user to speed them up where we decided
not to do it ourselves because we know the consequences.

I think the root cause of the problem is `US` adding new filtering options
to public API without thinking about and documenting consequences.

J.


On Fri, May 31, 2024 at 7:38 AM Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:

> So I think the notion that *all possibly expensive queries* should have an
> index to support them is not a tenable one.  E.g. there are something like
> 5 params on TI list endpoint that don't have an index.
>
> In contrast with queries from airflow itself, the API queries are more
> arbitrary -- user can combine in any way.  So it's not so simple to
> determine just from the endpoint params what indexes will be needed.  And
> we're not going to add an index for all possible param combinations, that
> would be silly.  So I think like it or not I think we will necessarily draw
> the line at *some* point and say we're trying to cover the most likely /
> normal / reasonable uses and if your usage pattern is an outlier you might
> need to add an index yourself.  And I don't think that's unreasonable
>
> Maybe at least one conclusion from this is to add an index for each of
> start_date, end_date, and updated_at on TI.  Already the storage occupied
> by indexes on TI is more than the table data size -- I don't remember it
> might have been like 2x or something.  So then index count will increase
> from 8 to 11 and overhead increases.  Whether this is a material concern, I
> am honestly not sure -- I am not a db specialist.  But it does feel icky to
> add all the indexes when they are used by, I suspect, a very small
> percentage of airflow users overall.
>
> Part of this for me is also the reason these types of queries are run.
> Typically it is for analytics purposes, either on the fly or as part of a
> replication process to a DWH or something.  But there are more efficient
> mechanisms for this than pulling the data through the webserver, e.g.
> read-only replicas, or standard ETL processes, or just making better use of
> the indexes that are already there.  Do we consider "replication" a
> reasonable use of the REST API?
>
> So I may very well be alone on this. But (1) the API is not something
> everyone uses.  And (2) even of users of the API, probably only a very
> small percentage are using it in a way that needs these indexes --
> otherwise why haven't we seen requests for indexes on these columns on
> github?  So, to me, it doesn't seem crazy to just leave it to the user /
> platform to add.  But as usual, not a hill to die on.
>

Reply via email to