Re: [DISCUSS] indexes for API calls

Pierre Jeambrun Fri, 31 May 2024 01:10:54 -0700

Indeed Jarek I feel like this is another point in favor of stick to
"Postgres"


As mentioned, maybe we were a little reckless when adding all these kinds
of filters. If they are not often used and we rarely / never see
performance github issues on those, marking them as 'non optimised but here
for convenience' could be a good short term solution. (And maybe removing
those filters at some point)

What people use the API for (replicating the DB, building an UI, etc...) do
not really matter I think, Ideally I just think that the API should have a
sort of SLA and guarantee a reasonable response time for all endpoints
(through pagination, indexes, etc...). We can still document that some
endpoints are known to be slower for some reason and leave them out of this
performance constraint. (Having performance tests could be great otherwise
it will be very hard to track).

Also just to be sure, the slowness you observe for listing task instances
only appears with a certain combination of filters, while some other
combinations, yielding a comparable result set in terms of 'size' are much
quicker ? (Just trying to understand if the slowness comes from missing
index, or also possibly from missing loading options that would cause lazy
loading at serialization time slowing everything down, I don't see a lot of
eager loading to prevent those, but maybe the few that we have are enough).

Le ven. 31 mai 2024 à 09:38, Jarek Potiuk <[email protected]> a écrit :

> And to be perfectly honest -  if people (like me) hesitate on settling on
> architectural decisions because they are afraid that their changes might
> have unintended consequences, because we want to support all the different
> kinds of databases - this is one more reason we should stick to "Postgres
> only".  This is a very good example where we have real slow-downs that
> impact the speed the decisions are made (which impacts speed of feature
> development).
>
> On Fri, May 31, 2024 at 9:13 AM Jarek Potiuk <[email protected]> wrote:
>
> > Using UUIDS was the proposal how we can bypass the limitation of MYSQL
> for
> > Airflow 2 when we discussed whether to do a "simple" version of
> team-prefix
> > in dag id, or whether we want to mess with adding
> > yet-another-field-to-indexes-that-are-already-too-long-for-mysql and it
> was
> > based on the assumption we are going to that for versioning already, so I
> > thought that is the plan :). Main reason why I put the multi-team voting
> on
> > hold while we decide on the scope of Airflow 3 to be honest.
> >
> >
> >
> https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1714140873323569?thread_ts=1713607491.495499&cid=C06K9Q5G2UA
> >
> > J.
> >
> >
> > On Fri, May 31, 2024 at 9:04 AM Daniel Standish
> > <[email protected]> wrote:
> >
> >> Yes uuid is risky and problematic as primary key. If you do it you need
> to
> >> do carefully/ sequential.  But I think that we are not going with UUID
> pk
> >> on any tables at this time.
> >>
> >> BUT I do want to add a uuid for every TI try that is not PK but can be
> >> used
> >> as a more convenient identifier when tying together log events related
> to
> >> TIs emitted on different services.
> >>
> >> On Thu, May 30, 2024 at 11:53 PM Jarek Potiuk <[email protected]> wrote:
> >>
> >> > Also if we are speaking about indexes - a bit tangential but I know we
> >> were
> >> > planning to replace some of the primary keys (mainly because of mysql
> >> > limitations) with synthetic keys for DAG versioning casse where we
> >> planned
> >> > to use UUIDs).
> >> >
> >> > We should be very, very careful when doing it because I've learned
> >> recently
> >> > that due to UUID randomness the algorithm to rebalance binary tree do
> >> not
> >> > work well with random UUIDs which means that when you do a lot of
> >> inserts,
> >> > rebalancing of the trees is happening very often and performance of
> >> inserts
> >> > is heavily affected.  Postgres has a dedicated UUID field type for
> that
> >> > that produces an optimal index
> >> >
> >> >
> >>
> https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres
> >> > - however it might or might not be affected. Similarly MariaDB, but
> >> MySQL
> >> > does not seem to have proper UUID support, so we should really use
> UUID7
> >> > rather that UUID4 for such UUIDs in case we do not want to affect
> insert
> >> > performance on MySQL.
> >> >
> >> > WAT???? Yeah. That is. - I think - rather non-obvious side-effect.
> >> >
> >> > This learning I had adds much more strength to two related
> discussions:
> >> >
> >> > 1) This only adds more fire to "let's only stick to Postgres". We can
> >> > optimize the experience for our users way better and move faster.
> >> >
> >> > 2) Let's REALLY not put the power of deciding what indexes to put on
> the
> >> > database of ours in the hands of our users. There are many unobvious
> >> > consequences of adding indexes and while we can aim to learn, foresee
> >> and
> >> > control them (mostly when we have a working performance test suite),
> >> > getting a random user to put a random index on a random set of keys in
> >> > Airflow DB might have a number of unobvious consequences which might
> >> give
> >> > our users (and us) a LOT of headache.
> >> >
> >> > J.
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, May 31, 2024 at 8:13 AM Pankaj Koti
> >> > <[email protected]> wrote:
> >> >
> >> > > Addressing one of Pierre's questions:
> >> > >
> >> > > Should I index foreign keys? Is that done by default or should I
> >> > explicitly
> >> > > do it?
> >> > >
> >> > > The answer varies depending on the database engine.
> >> > > PostgreSQL and SQLite do not add indexes on foreign keys
> >> > > by default, while MySQL does. Developers should keep this
> >> > > in mind. Many of us (myself included) may have assumed
> >> > > from our university days, indexes are always created for
> >> > > foreign keys, but that's not always the case. :)
> >> > >
> >> > > Best regards,
> >> > >
> >> > > *Pankaj Koti*
> >> > > Senior Software Engineer (Airflow OSS Engineering team)
> >> > > Location: Pune, Maharashtra, India
> >> > > Timezone: Indian Standard Time (IST)
> >> > >
> >> > >
> >> > > On Fri, May 31, 2024 at 11:37 AM Daniel Standish
> >> > > <[email protected]> wrote:
> >> > >
> >> > > > I would be in favor of this for sure.  Let's see what others think
> >> :)
> >> > > >
> >> > > > On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <[email protected]>
> >> > wrote:
> >> > > >
> >> > > > > Simply speaking - let's make "lack of optimisation for these and
> >> > that"
> >> > > > part
> >> > > > > of the API specification.
> >> > > > >
> >> > > > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <[email protected]>
> >> > wrote:
> >> > > > >
> >> > > > > > So let's document as part of the API which queries are not
> >> > performant
> >> > > > and
> >> > > > > > suggest users that want to use them to make their analytics
> >> queries
> >> > > > > > elsewhere. I'd very much prefer that it's slow "by design" for
> >> > > everyone
> >> > > > > > rather than add option for the user to speed them up where we
> >> > decided
> >> > > > > > not to do it ourselves because we know the consequences.
> >> > > > > >
> >> > > > > > I think the root cause of the problem is `US` adding new
> >> filtering
> >> > > > > options
> >> > > > > > to public API without thinking about and documenting
> >> consequences.
> >> > > > > >
> >> > > > > > J.
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish
> >> > > > > > <[email protected]> wrote:
> >> > > > > >
> >> > > > > >> So I think the notion that *all possibly expensive queries*
> >> should
> >> > > > have
> >> > > > > an
> >> > > > > >> index to support them is not a tenable one.  E.g. there are
> >> > > something
> >> > > > > like
> >> > > > > >> 5 params on TI list endpoint that don't have an index.
> >> > > > > >>
> >> > > > > >> In contrast with queries from airflow itself, the API queries
> >> are
> >> > > more
> >> > > > > >> arbitrary -- user can combine in any way.  So it's not so
> >> simple
> >> > to
> >> > > > > >> determine just from the endpoint params what indexes will be
> >> > needed.
> >> > > > > And
> >> > > > > >> we're not going to add an index for all possible param
> >> > combinations,
> >> > > > > that
> >> > > > > >> would be silly.  So I think like it or not I think we will
> >> > > necessarily
> >> > > > > >> draw
> >> > > > > >> the line at *some* point and say we're trying to cover the
> most
> >> > > > likely /
> >> > > > > >> normal / reasonable uses and if your usage pattern is an
> >> outlier
> >> > you
> >> > > > > might
> >> > > > > >> need to add an index yourself.  And I don't think that's
> >> > > unreasonable
> >> > > > > >>
> >> > > > > >> Maybe at least one conclusion from this is to add an index
> for
> >> > each
> >> > > of
> >> > > > > >> start_date, end_date, and updated_at on TI.  Already the
> >> storage
> >> > > > > occupied
> >> > > > > >> by indexes on TI is more than the table data size -- I don't
> >> > > remember
> >> > > > it
> >> > > > > >> might have been like 2x or something.  So then index count
> will
> >> > > > increase
> >> > > > > >> from 8 to 11 and overhead increases.  Whether this is a
> >> material
> >> > > > > concern,
> >> > > > > >> I
> >> > > > > >> am honestly not sure -- I am not a db specialist.  But it
> does
> >> > feel
> >> > > > icky
> >> > > > > >> to
> >> > > > > >> add all the indexes when they are used by, I suspect, a very
> >> small
> >> > > > > >> percentage of airflow users overall.
> >> > > > > >>
> >> > > > > >> Part of this for me is also the reason these types of queries
> >> are
> >> > > run.
> >> > > > > >> Typically it is for analytics purposes, either on the fly or
> as
> >> > part
> >> > > > of
> >> > > > > a
> >> > > > > >> replication process to a DWH or something.  But there are
> more
> >> > > > efficient
> >> > > > > >> mechanisms for this than pulling the data through the
> >> webserver,
> >> > > e.g.
> >> > > > > >> read-only replicas, or standard ETL processes, or just making
> >> > better
> >> > > > use
> >> > > > > >> of
> >> > > > > >> the indexes that are already there.  Do we consider
> >> "replication"
> >> > a
> >> > > > > >> reasonable use of the REST API?
> >> > > > > >>
> >> > > > > >> So I may very well be alone on this. But (1) the API is not
> >> > > something
> >> > > > > >> everyone uses.  And (2) even of users of the API, probably
> >> only a
> >> > > very
> >> > > > > >> small percentage are using it in a way that needs these
> >> indexes --
> >> > > > > >> otherwise why haven't we seen requests for indexes on these
> >> > columns
> >> > > on
> >> > > > > >> github?  So, to me, it doesn't seem crazy to just leave it to
> >> the
> >> > > > user /
> >> > > > > >> platform to add.  But as usual, not a hill to die on.
> >> > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] indexes for API calls

Reply via email to