Re: [DISCUSS] indexes for API calls

Daniel Standish Thu, 30 May 2024 23:07:00 -0700

I would be in favor of this for sure.  Let's see what others think :)

On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:


> Simply speaking - let's make "lack of optimisation for these and that" part
> of the API specification.
>
> On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > So let's document as part of the API which queries are not performant and
> > suggest users that want to use them to make their analytics queries
> > elsewhere. I'd very much prefer that it's slow "by design" for everyone
> > rather than add option for the user to speed them up where we decided
> > not to do it ourselves because we know the consequences.
> >
> > I think the root cause of the problem is `US` adding new filtering
> options
> > to public API without thinking about and documenting consequences.
> >
> > J.
> >
> >
> > On Fri, May 31, 2024 at 7:38 AM Daniel Standish
> > <daniel.stand...@astronomer.io.invalid> wrote:
> >
> >> So I think the notion that *all possibly expensive queries* should have
> an
> >> index to support them is not a tenable one.  E.g. there are something
> like
> >> 5 params on TI list endpoint that don't have an index.
> >>
> >> In contrast with queries from airflow itself, the API queries are more
> >> arbitrary -- user can combine in any way.  So it's not so simple to
> >> determine just from the endpoint params what indexes will be needed.
> And
> >> we're not going to add an index for all possible param combinations,
> that
> >> would be silly.  So I think like it or not I think we will necessarily
> >> draw
> >> the line at *some* point and say we're trying to cover the most likely /
> >> normal / reasonable uses and if your usage pattern is an outlier you
> might
> >> need to add an index yourself.  And I don't think that's unreasonable
> >>
> >> Maybe at least one conclusion from this is to add an index for each of
> >> start_date, end_date, and updated_at on TI.  Already the storage
> occupied
> >> by indexes on TI is more than the table data size -- I don't remember it
> >> might have been like 2x or something.  So then index count will increase
> >> from 8 to 11 and overhead increases.  Whether this is a material
> concern,
> >> I
> >> am honestly not sure -- I am not a db specialist.  But it does feel icky
> >> to
> >> add all the indexes when they are used by, I suspect, a very small
> >> percentage of airflow users overall.
> >>
> >> Part of this for me is also the reason these types of queries are run.
> >> Typically it is for analytics purposes, either on the fly or as part of
> a
> >> replication process to a DWH or something.  But there are more efficient
> >> mechanisms for this than pulling the data through the webserver, e.g.
> >> read-only replicas, or standard ETL processes, or just making better use
> >> of
> >> the indexes that are already there.  Do we consider "replication" a
> >> reasonable use of the REST API?
> >>
> >> So I may very well be alone on this. But (1) the API is not something
> >> everyone uses.  And (2) even of users of the API, probably only a very
> >> small percentage are using it in a way that needs these indexes --
> >> otherwise why haven't we seen requests for indexes on these columns on
> >> github?  So, to me, it doesn't seem crazy to just leave it to the user /
> >> platform to add.  But as usual, not a hill to die on.
> >>
> >
>

Re: [DISCUSS] indexes for API calls

Reply via email to