Re: [DISCUSS] indexes for API calls

Jarek Potiuk Fri, 31 May 2024 00:13:25 -0700

Using UUIDS was the proposal how we can bypass the limitation of MYSQL for
Airflow 2 when we discussed whether to do a "simple" version of team-prefix
in dag id, or whether we want to mess with adding
yet-another-field-to-indexes-that-are-already-too-long-for-mysql and it was
based on the assumption we are going to that for versioning already, so I
thought that is the plan :). Main reason why I put the multi-team voting on
hold while we decide on the scope of Airflow 3 to be honest.


https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1714140873323569?thread_ts=1713607491.495499&cid=C06K9Q5G2UA

J.


On Fri, May 31, 2024 at 9:04 AM Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:

> Yes uuid is risky and problematic as primary key. If you do it you need to
> do carefully/ sequential.  But I think that we are not going with UUID pk
> on any tables at this time.
>
> BUT I do want to add a uuid for every TI try that is not PK but can be used
> as a more convenient identifier when tying together log events related to
> TIs emitted on different services.
>
> On Thu, May 30, 2024 at 11:53 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > Also if we are speaking about indexes - a bit tangential but I know we
> were
> > planning to replace some of the primary keys (mainly because of mysql
> > limitations) with synthetic keys for DAG versioning casse where we
> planned
> > to use UUIDs).
> >
> > We should be very, very careful when doing it because I've learned
> recently
> > that due to UUID randomness the algorithm to rebalance binary tree do not
> > work well with random UUIDs which means that when you do a lot of
> inserts,
> > rebalancing of the trees is happening very often and performance of
> inserts
> > is heavily affected.  Postgres has a dedicated UUID field type for that
> > that produces an optimal index
> >
> >
> https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres
> > - however it might or might not be affected. Similarly MariaDB, but MySQL
> > does not seem to have proper UUID support, so we should really use UUID7
> > rather that UUID4 for such UUIDs in case we do not want to affect insert
> > performance on MySQL.
> >
> > WAT???? Yeah. That is. - I think - rather non-obvious side-effect.
> >
> > This learning I had adds much more strength to two related discussions:
> >
> > 1) This only adds more fire to "let's only stick to Postgres". We can
> > optimize the experience for our users way better and move faster.
> >
> > 2) Let's REALLY not put the power of deciding what indexes to put on the
> > database of ours in the hands of our users. There are many unobvious
> > consequences of adding indexes and while we can aim to learn, foresee and
> > control them (mostly when we have a working performance test suite),
> > getting a random user to put a random index on a random set of keys in
> > Airflow DB might have a number of unobvious consequences which might give
> > our users (and us) a LOT of headache.
> >
> > J.
> >
> >
> >
> >
> > On Fri, May 31, 2024 at 8:13 AM Pankaj Koti
> > <pankaj.k...@astronomer.io.invalid> wrote:
> >
> > > Addressing one of Pierre's questions:
> > >
> > > Should I index foreign keys? Is that done by default or should I
> > explicitly
> > > do it?
> > >
> > > The answer varies depending on the database engine.
> > > PostgreSQL and SQLite do not add indexes on foreign keys
> > > by default, while MySQL does. Developers should keep this
> > > in mind. Many of us (myself included) may have assumed
> > > from our university days, indexes are always created for
> > > foreign keys, but that's not always the case. :)
> > >
> > > Best regards,
> > >
> > > *Pankaj Koti*
> > > Senior Software Engineer (Airflow OSS Engineering team)
> > > Location: Pune, Maharashtra, India
> > > Timezone: Indian Standard Time (IST)
> > >
> > >
> > > On Fri, May 31, 2024 at 11:37 AM Daniel Standish
> > > <daniel.stand...@astronomer.io.invalid> wrote:
> > >
> > > > I would be in favor of this for sure.  Let's see what others think :)
> > > >
> > > > On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <ja...@potiuk.com>
> > wrote:
> > > >
> > > > > Simply speaking - let's make "lack of optimisation for these and
> > that"
> > > > part
> > > > > of the API specification.
> > > > >
> > > > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <ja...@potiuk.com>
> > wrote:
> > > > >
> > > > > > So let's document as part of the API which queries are not
> > performant
> > > > and
> > > > > > suggest users that want to use them to make their analytics
> queries
> > > > > > elsewhere. I'd very much prefer that it's slow "by design" for
> > > everyone
> > > > > > rather than add option for the user to speed them up where we
> > decided
> > > > > > not to do it ourselves because we know the consequences.
> > > > > >
> > > > > > I think the root cause of the problem is `US` adding new
> filtering
> > > > > options
> > > > > > to public API without thinking about and documenting
> consequences.
> > > > > >
> > > > > > J.
> > > > > >
> > > > > >
> > > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish
> > > > > > <daniel.stand...@astronomer.io.invalid> wrote:
> > > > > >
> > > > > >> So I think the notion that *all possibly expensive queries*
> should
> > > > have
> > > > > an
> > > > > >> index to support them is not a tenable one.  E.g. there are
> > > something
> > > > > like
> > > > > >> 5 params on TI list endpoint that don't have an index.
> > > > > >>
> > > > > >> In contrast with queries from airflow itself, the API queries
> are
> > > more
> > > > > >> arbitrary -- user can combine in any way.  So it's not so simple
> > to
> > > > > >> determine just from the endpoint params what indexes will be
> > needed.
> > > > > And
> > > > > >> we're not going to add an index for all possible param
> > combinations,
> > > > > that
> > > > > >> would be silly.  So I think like it or not I think we will
> > > necessarily
> > > > > >> draw
> > > > > >> the line at *some* point and say we're trying to cover the most
> > > > likely /
> > > > > >> normal / reasonable uses and if your usage pattern is an outlier
> > you
> > > > > might
> > > > > >> need to add an index yourself.  And I don't think that's
> > > unreasonable
> > > > > >>
> > > > > >> Maybe at least one conclusion from this is to add an index for
> > each
> > > of
> > > > > >> start_date, end_date, and updated_at on TI.  Already the storage
> > > > > occupied
> > > > > >> by indexes on TI is more than the table data size -- I don't
> > > remember
> > > > it
> > > > > >> might have been like 2x or something.  So then index count will
> > > > increase
> > > > > >> from 8 to 11 and overhead increases.  Whether this is a material
> > > > > concern,
> > > > > >> I
> > > > > >> am honestly not sure -- I am not a db specialist.  But it does
> > feel
> > > > icky
> > > > > >> to
> > > > > >> add all the indexes when they are used by, I suspect, a very
> small
> > > > > >> percentage of airflow users overall.
> > > > > >>
> > > > > >> Part of this for me is also the reason these types of queries
> are
> > > run.
> > > > > >> Typically it is for analytics purposes, either on the fly or as
> > part
> > > > of
> > > > > a
> > > > > >> replication process to a DWH or something.  But there are more
> > > > efficient
> > > > > >> mechanisms for this than pulling the data through the webserver,
> > > e.g.
> > > > > >> read-only replicas, or standard ETL processes, or just making
> > better
> > > > use
> > > > > >> of
> > > > > >> the indexes that are already there.  Do we consider
> "replication"
> > a
> > > > > >> reasonable use of the REST API?
> > > > > >>
> > > > > >> So I may very well be alone on this. But (1) the API is not
> > > something
> > > > > >> everyone uses.  And (2) even of users of the API, probably only
> a
> > > very
> > > > > >> small percentage are using it in a way that needs these indexes
> --
> > > > > >> otherwise why haven't we seen requests for indexes on these
> > columns
> > > on
> > > > > >> github?  So, to me, it doesn't seem crazy to just leave it to
> the
> > > > user /
> > > > > >> platform to add.  But as usual, not a hill to die on.
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] indexes for API calls

Reply via email to