Re: [DISCUSS] indexes for API calls

Jarek Potiuk Thu, 30 May 2024 22:56:00 -0700

Simply speaking - let's make "lack of optimisation for these and that" part
of the API specification.


On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <[email protected]> wrote:

> So let's document as part of the API which queries are not performant and
> suggest users that want to use them to make their analytics queries
> elsewhere. I'd very much prefer that it's slow "by design" for everyone
> rather than add option for the user to speed them up where we decided
> not to do it ourselves because we know the consequences.
>
> I think the root cause of the problem is `US` adding new filtering options
> to public API without thinking about and documenting consequences.
>
> J.
>
>
> On Fri, May 31, 2024 at 7:38 AM Daniel Standish
> <[email protected]> wrote:
>
>> So I think the notion that *all possibly expensive queries* should have an
>> index to support them is not a tenable one.  E.g. there are something like
>> 5 params on TI list endpoint that don't have an index.
>>
>> In contrast with queries from airflow itself, the API queries are more
>> arbitrary -- user can combine in any way.  So it's not so simple to
>> determine just from the endpoint params what indexes will be needed.  And
>> we're not going to add an index for all possible param combinations, that
>> would be silly.  So I think like it or not I think we will necessarily
>> draw
>> the line at *some* point and say we're trying to cover the most likely /
>> normal / reasonable uses and if your usage pattern is an outlier you might
>> need to add an index yourself.  And I don't think that's unreasonable
>>
>> Maybe at least one conclusion from this is to add an index for each of
>> start_date, end_date, and updated_at on TI.  Already the storage occupied
>> by indexes on TI is more than the table data size -- I don't remember it
>> might have been like 2x or something.  So then index count will increase
>> from 8 to 11 and overhead increases.  Whether this is a material concern,
>> I
>> am honestly not sure -- I am not a db specialist.  But it does feel icky
>> to
>> add all the indexes when they are used by, I suspect, a very small
>> percentage of airflow users overall.
>>
>> Part of this for me is also the reason these types of queries are run.
>> Typically it is for analytics purposes, either on the fly or as part of a
>> replication process to a DWH or something.  But there are more efficient
>> mechanisms for this than pulling the data through the webserver, e.g.
>> read-only replicas, or standard ETL processes, or just making better use
>> of
>> the indexes that are already there.  Do we consider "replication" a
>> reasonable use of the REST API?
>>
>> So I may very well be alone on this. But (1) the API is not something
>> everyone uses.  And (2) even of users of the API, probably only a very
>> small percentage are using it in a way that needs these indexes --
>> otherwise why haven't we seen requests for indexes on these columns on
>> github?  So, to me, it doesn't seem crazy to just leave it to the user /
>> platform to add.  But as usual, not a hill to die on.
>>
>

Re: [DISCUSS] indexes for API calls

Reply via email to