I would be in favor of this for sure. Let's see what others think :) On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> Simply speaking - let's make "lack of optimisation for these and that" part > of the API specification. > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <ja...@potiuk.com> wrote: > > > So let's document as part of the API which queries are not performant and > > suggest users that want to use them to make their analytics queries > > elsewhere. I'd very much prefer that it's slow "by design" for everyone > > rather than add option for the user to speed them up where we decided > > not to do it ourselves because we know the consequences. > > > > I think the root cause of the problem is `US` adding new filtering > options > > to public API without thinking about and documenting consequences. > > > > J. > > > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish > > <daniel.stand...@astronomer.io.invalid> wrote: > > > >> So I think the notion that *all possibly expensive queries* should have > an > >> index to support them is not a tenable one. E.g. there are something > like > >> 5 params on TI list endpoint that don't have an index. > >> > >> In contrast with queries from airflow itself, the API queries are more > >> arbitrary -- user can combine in any way. So it's not so simple to > >> determine just from the endpoint params what indexes will be needed. > And > >> we're not going to add an index for all possible param combinations, > that > >> would be silly. So I think like it or not I think we will necessarily > >> draw > >> the line at *some* point and say we're trying to cover the most likely / > >> normal / reasonable uses and if your usage pattern is an outlier you > might > >> need to add an index yourself. And I don't think that's unreasonable > >> > >> Maybe at least one conclusion from this is to add an index for each of > >> start_date, end_date, and updated_at on TI. Already the storage > occupied > >> by indexes on TI is more than the table data size -- I don't remember it > >> might have been like 2x or something. So then index count will increase > >> from 8 to 11 and overhead increases. Whether this is a material > concern, > >> I > >> am honestly not sure -- I am not a db specialist. But it does feel icky > >> to > >> add all the indexes when they are used by, I suspect, a very small > >> percentage of airflow users overall. > >> > >> Part of this for me is also the reason these types of queries are run. > >> Typically it is for analytics purposes, either on the fly or as part of > a > >> replication process to a DWH or something. But there are more efficient > >> mechanisms for this than pulling the data through the webserver, e.g. > >> read-only replicas, or standard ETL processes, or just making better use > >> of > >> the indexes that are already there. Do we consider "replication" a > >> reasonable use of the REST API? > >> > >> So I may very well be alone on this. But (1) the API is not something > >> everyone uses. And (2) even of users of the API, probably only a very > >> small percentage are using it in a way that needs these indexes -- > >> otherwise why haven't we seen requests for indexes on these columns on > >> github? So, to me, it doesn't seem crazy to just leave it to the user / > >> platform to add. But as usual, not a hill to die on. > >> > > >