Simply speaking - let's make "lack of optimisation for these and that" part of the API specification.
On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <[email protected]> wrote: > So let's document as part of the API which queries are not performant and > suggest users that want to use them to make their analytics queries > elsewhere. I'd very much prefer that it's slow "by design" for everyone > rather than add option for the user to speed them up where we decided > not to do it ourselves because we know the consequences. > > I think the root cause of the problem is `US` adding new filtering options > to public API without thinking about and documenting consequences. > > J. > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish > <[email protected]> wrote: > >> So I think the notion that *all possibly expensive queries* should have an >> index to support them is not a tenable one. E.g. there are something like >> 5 params on TI list endpoint that don't have an index. >> >> In contrast with queries from airflow itself, the API queries are more >> arbitrary -- user can combine in any way. So it's not so simple to >> determine just from the endpoint params what indexes will be needed. And >> we're not going to add an index for all possible param combinations, that >> would be silly. So I think like it or not I think we will necessarily >> draw >> the line at *some* point and say we're trying to cover the most likely / >> normal / reasonable uses and if your usage pattern is an outlier you might >> need to add an index yourself. And I don't think that's unreasonable >> >> Maybe at least one conclusion from this is to add an index for each of >> start_date, end_date, and updated_at on TI. Already the storage occupied >> by indexes on TI is more than the table data size -- I don't remember it >> might have been like 2x or something. So then index count will increase >> from 8 to 11 and overhead increases. Whether this is a material concern, >> I >> am honestly not sure -- I am not a db specialist. But it does feel icky >> to >> add all the indexes when they are used by, I suspect, a very small >> percentage of airflow users overall. >> >> Part of this for me is also the reason these types of queries are run. >> Typically it is for analytics purposes, either on the fly or as part of a >> replication process to a DWH or something. But there are more efficient >> mechanisms for this than pulling the data through the webserver, e.g. >> read-only replicas, or standard ETL processes, or just making better use >> of >> the indexes that are already there. Do we consider "replication" a >> reasonable use of the REST API? >> >> So I may very well be alone on this. But (1) the API is not something >> everyone uses. And (2) even of users of the API, probably only a very >> small percentage are using it in a way that needs these indexes -- >> otherwise why haven't we seen requests for indexes on these columns on >> github? So, to me, it doesn't seem crazy to just leave it to the user / >> platform to add. But as usual, not a hill to die on. >> >
