Re: [DISCUSS] indexes for API calls

Andrey Anshin Fri, 31 May 2024 11:38:13 -0700

> however it might or might not be affected. Similarly MariaDB, but MySQL does
not seem to have proper UUID support, so we should really use UUID7 rather
that UUID4 for such UUIDs in case we do not want to affect insert performance
on MySQL.


Some NITs here, I guess better to use deterministic UUID, such as v5 or v3
instead of random/timebased based like v1/v4/v7. So if you have all
components + hash you could always generate this UUID on client side,
instead of get it by JOIN or somewhere else

MySQL do not have native implementation, however SQLAlchemy 2 has a
dialect-neutral UUID, that mean if DB backend supports it natively, then it
will use it otherwise CHAR(32):
https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.Uuid
, I don't think that is a problem to have it also in SA 14


On Fri, 31 May 2024 at 19:28, Vincent Beck <vincb...@apache.org> wrote:

> Interesting thread.
>
> I think what makes this discussion complex is that Airflow makes a lot of
> different queries (API, Scheduler, ...). I think it is even harder to keep
> track of all the different queries Airflow makes and thus, hard to figure
> if such index is needed. Also, Airflow evolves (and rightfully so),
> therefore, one index might have been useful in the past but not anymore due
> to some modifications in the code.
>
> What would help us is a clear contract between the database and Airflow,
> meaning what are all the operations/queries Airflow needs. I know it had
> been mentioned during the last community sync and some folks are against
> but I wanted to reiterate this idea of having an API/interface for any DB
> operation. I dont know the consequences in terms of performance (especially
> for the scheduler) of even feasibility, but I think it might be worth to
> think about it? Especially because that would open some doors such as using
> NoSQL DB engines to run Airflow.
>
> (Sorry if I deviated the conversation a bit).
>
> On 2024/05/31 12:10:11 Andrey Anshin wrote:
> > IMHO, blindy adding new indexes into the `dag_run` and `task_instance`
> > tables will cause additional maintenance costs.
> >
> > There are 8 indexes already exists per each of this tables
> >
> > SELECT
> >     pi.schemaname schema_name,
> >     pi.tablename table_name,
> >     count(*) num
> > FROM pg_indexes pi
> > WHERE pi.schemaname = 'public'
> > GROUP BY 1, 2
> > ORDER BY 3 DESC, 2;
> >
> > And maintenance indexes overall it is not something cheap, every time
> when
> > the row changed/added (INSERT/UPDATE/DELETE/MERGE/COPY) there is a check
> > whether or not it is required to rebalance index tree (B-Tree indexes).
> >
> > If we would like to add individual indexes for one field we bump into the
> > problem that this are less effective when it comes to a filter/sorting
> over
> > multiple different fields (include PK), just because this action required
> > to create Bitmap Scan into the memory (Working Memory in postgres,
> default
> > something like 10 MiB), and in the end COB (Cost Base Optimisation)
> > mechanism more like decide to do sequential scan, as result indexes
> become
> > useless, and just use additional disc space and CPU for maintenance.
> >
> > Rather than add these indexes (and every other "performance optimisation"
> > measures) we should prove that this action gains benefits, rather than
> the
> > opposite. That might be achieved by enable on some prod-like environment
> > logs collection and analyze logs by pgBadger (
> > https://github.com/darold/pgbadger), that might shows different metrics
> > before changes and after the changes
> >
> > This all valid for Postgres, and I have no idea how to calculate measures
> > for MySQL, I hope we will reach an agreement to drop MySQL in Airflow 3.
> >
> > On Fri, 31 May 2024 at 12:10, Pierre Jeambrun <pierrejb...@gmail.com>
> wrote:
> >
> > > Indeed Jarek I feel like this is another point in favor of stick to
> > > "Postgres"
> > >
> > > As mentioned, maybe we were a little reckless when adding all these
> kinds
> > > of filters. If they are not often used and we rarely / never see
> > > performance github issues on those, marking them as 'non optimised but
> here
> > > for convenience' could be a good short term solution. (And maybe
> removing
> > > those filters at some point)
> > >
> > > What people use the API for (replicating the DB, building an UI,
> etc...) do
> > > not really matter I think, Ideally I just think that the API should
> have a
> > > sort of SLA and guarantee a reasonable response time for all endpoints
> > > (through pagination, indexes, etc...). We can still document that some
> > > endpoints are known to be slower for some reason and leave them out of
> this
> > > performance constraint. (Having performance tests could be great
> otherwise
> > > it will be very hard to track).
> > >
> > > Also just to be sure, the slowness you observe for listing task
> instances
> > > only appears with a certain combination of filters, while some other
> > > combinations, yielding a comparable result set in terms of 'size' are
> much
> > > quicker ? (Just trying to understand if the slowness comes from missing
> > > index, or also possibly from missing loading options that would cause
> lazy
> > > loading at serialization time slowing everything down, I don't see a
> lot of
> > > eager loading to prevent those, but maybe the few that we have are
> enough).
> > >
> > > Le ven. 31 mai 2024 à 09:38, Jarek Potiuk <ja...@potiuk.com> a écrit :
> > >
> > > > And to be perfectly honest -  if people (like me) hesitate on
> settling on
> > > > architectural decisions because they are afraid that their changes
> might
> > > > have unintended consequences, because we want to support all the
> > > different
> > > > kinds of databases - this is one more reason we should stick to
> "Postgres
> > > > only".  This is a very good example where we have real slow-downs
> that
> > > > impact the speed the decisions are made (which impacts speed of
> feature
> > > > development).
> > > >
> > > > On Fri, May 31, 2024 at 9:13 AM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > > >
> > > > > Using UUIDS was the proposal how we can bypass the limitation of
> MYSQL
> > > > for
> > > > > Airflow 2 when we discussed whether to do a "simple" version of
> > > > team-prefix
> > > > > in dag id, or whether we want to mess with adding
> > > > > yet-another-field-to-indexes-that-are-already-too-long-for-mysql
> and it
> > > > was
> > > > > based on the assumption we are going to that for versioning
> already,
> > > so I
> > > > > thought that is the plan :). Main reason why I put the multi-team
> > > voting
> > > > on
> > > > > hold while we decide on the scope of Airflow 3 to be honest.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1714140873323569?thread_ts=1713607491.495499&cid=C06K9Q5G2UA
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Fri, May 31, 2024 at 9:04 AM Daniel Standish
> > > > > <daniel.stand...@astronomer.io.invalid> wrote:
> > > > >
> > > > >> Yes uuid is risky and problematic as primary key. If you do it you
> > > need
> > > > to
> > > > >> do carefully/ sequential.  But I think that we are not going with
> UUID
> > > > pk
> > > > >> on any tables at this time.
> > > > >>
> > > > >> BUT I do want to add a uuid for every TI try that is not PK but
> can be
> > > > >> used
> > > > >> as a more convenient identifier when tying together log events
> related
> > > > to
> > > > >> TIs emitted on different services.
> > > > >>
> > > > >> On Thu, May 30, 2024 at 11:53 PM Jarek Potiuk <ja...@potiuk.com>
> > > wrote:
> > > > >>
> > > > >> > Also if we are speaking about indexes - a bit tangential but I
> know
> > > we
> > > > >> were
> > > > >> > planning to replace some of the primary keys (mainly because of
> > > mysql
> > > > >> > limitations) with synthetic keys for DAG versioning casse where
> we
> > > > >> planned
> > > > >> > to use UUIDs).
> > > > >> >
> > > > >> > We should be very, very careful when doing it because I've
> learned
> > > > >> recently
> > > > >> > that due to UUID randomness the algorithm to rebalance binary
> tree
> > > do
> > > > >> not
> > > > >> > work well with random UUIDs which means that when you do a lot
> of
> > > > >> inserts,
> > > > >> > rebalancing of the trees is happening very often and
> performance of
> > > > >> inserts
> > > > >> > is heavily affected.  Postgres has a dedicated UUID field type
> for
> > > > that
> > > > >> > that produces an optimal index
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres
> > > > >> > - however it might or might not be affected. Similarly MariaDB,
> but
> > > > >> MySQL
> > > > >> > does not seem to have proper UUID support, so we should really
> use
> > > > UUID7
> > > > >> > rather that UUID4 for such UUIDs in case we do not want to
> affect
> > > > insert
> > > > >> > performance on MySQL.
> > > > >> >
> > > > >> > WAT???? Yeah. That is. - I think - rather non-obvious
> side-effect.
> > > > >> >
> > > > >> > This learning I had adds much more strength to two related
> > > > discussions:
> > > > >> >
> > > > >> > 1) This only adds more fire to "let's only stick to Postgres".
> We
> > > can
> > > > >> > optimize the experience for our users way better and move
> faster.
> > > > >> >
> > > > >> > 2) Let's REALLY not put the power of deciding what indexes to
> put on
> > > > the
> > > > >> > database of ours in the hands of our users. There are many
> unobvious
> > > > >> > consequences of adding indexes and while we can aim to learn,
> > > foresee
> > > > >> and
> > > > >> > control them (mostly when we have a working performance test
> suite),
> > > > >> > getting a random user to put a random index on a random set of
> keys
> > > in
> > > > >> > Airflow DB might have a number of unobvious consequences which
> might
> > > > >> give
> > > > >> > our users (and us) a LOT of headache.
> > > > >> >
> > > > >> > J.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, May 31, 2024 at 8:13 AM Pankaj Koti
> > > > >> > <pankaj.k...@astronomer.io.invalid> wrote:
> > > > >> >
> > > > >> > > Addressing one of Pierre's questions:
> > > > >> > >
> > > > >> > > Should I index foreign keys? Is that done by default or
> should I
> > > > >> > explicitly
> > > > >> > > do it?
> > > > >> > >
> > > > >> > > The answer varies depending on the database engine.
> > > > >> > > PostgreSQL and SQLite do not add indexes on foreign keys
> > > > >> > > by default, while MySQL does. Developers should keep this
> > > > >> > > in mind. Many of us (myself included) may have assumed
> > > > >> > > from our university days, indexes are always created for
> > > > >> > > foreign keys, but that's not always the case. :)
> > > > >> > >
> > > > >> > > Best regards,
> > > > >> > >
> > > > >> > > *Pankaj Koti*
> > > > >> > > Senior Software Engineer (Airflow OSS Engineering team)
> > > > >> > > Location: Pune, Maharashtra, India
> > > > >> > > Timezone: Indian Standard Time (IST)
> > > > >> > >
> > > > >> > >
> > > > >> > > On Fri, May 31, 2024 at 11:37 AM Daniel Standish
> > > > >> > > <daniel.stand...@astronomer.io.invalid> wrote:
> > > > >> > >
> > > > >> > > > I would be in favor of this for sure.  Let's see what others
> > > think
> > > > >> :)
> > > > >> > > >
> > > > >> > > > On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <
> ja...@potiuk.com
> > > >
> > > > >> > wrote:
> > > > >> > > >
> > > > >> > > > > Simply speaking - let's make "lack of optimisation for
> these
> > > and
> > > > >> > that"
> > > > >> > > > part
> > > > >> > > > > of the API specification.
> > > > >> > > > >
> > > > >> > > > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <
> > > ja...@potiuk.com>
> > > > >> > wrote:
> > > > >> > > > >
> > > > >> > > > > > So let's document as part of the API which queries are
> not
> > > > >> > performant
> > > > >> > > > and
> > > > >> > > > > > suggest users that want to use them to make their
> analytics
> > > > >> queries
> > > > >> > > > > > elsewhere. I'd very much prefer that it's slow "by
> design"
> > > for
> > > > >> > > everyone
> > > > >> > > > > > rather than add option for the user to speed them up
> where
> > > we
> > > > >> > decided
> > > > >> > > > > > not to do it ourselves because we know the consequences.
> > > > >> > > > > >
> > > > >> > > > > > I think the root cause of the problem is `US` adding new
> > > > >> filtering
> > > > >> > > > > options
> > > > >> > > > > > to public API without thinking about and documenting
> > > > >> consequences.
> > > > >> > > > > >
> > > > >> > > > > > J.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish
> > > > >> > > > > > <daniel.stand...@astronomer.io.invalid> wrote:
> > > > >> > > > > >
> > > > >> > > > > >> So I think the notion that *all possibly expensive
> queries*
> > > > >> should
> > > > >> > > > have
> > > > >> > > > > an
> > > > >> > > > > >> index to support them is not a tenable one.  E.g.
> there are
> > > > >> > > something
> > > > >> > > > > like
> > > > >> > > > > >> 5 params on TI list endpoint that don't have an index.
> > > > >> > > > > >>
> > > > >> > > > > >> In contrast with queries from airflow itself, the API
> > > queries
> > > > >> are
> > > > >> > > more
> > > > >> > > > > >> arbitrary -- user can combine in any way.  So it's not
> so
> > > > >> simple
> > > > >> > to
> > > > >> > > > > >> determine just from the endpoint params what indexes
> will
> > > be
> > > > >> > needed.
> > > > >> > > > > And
> > > > >> > > > > >> we're not going to add an index for all possible param
> > > > >> > combinations,
> > > > >> > > > > that
> > > > >> > > > > >> would be silly.  So I think like it or not I think we
> will
> > > > >> > > necessarily
> > > > >> > > > > >> draw
> > > > >> > > > > >> the line at *some* point and say we're trying to cover
> the
> > > > most
> > > > >> > > > likely /
> > > > >> > > > > >> normal / reasonable uses and if your usage pattern is
> an
> > > > >> outlier
> > > > >> > you
> > > > >> > > > > might
> > > > >> > > > > >> need to add an index yourself.  And I don't think
> that's
> > > > >> > > unreasonable
> > > > >> > > > > >>
> > > > >> > > > > >> Maybe at least one conclusion from this is to add an
> index
> > > > for
> > > > >> > each
> > > > >> > > of
> > > > >> > > > > >> start_date, end_date, and updated_at on TI.  Already
> the
> > > > >> storage
> > > > >> > > > > occupied
> > > > >> > > > > >> by indexes on TI is more than the table data size -- I
> > > don't
> > > > >> > > remember
> > > > >> > > > it
> > > > >> > > > > >> might have been like 2x or something.  So then index
> count
> > > > will
> > > > >> > > > increase
> > > > >> > > > > >> from 8 to 11 and overhead increases.  Whether this is a
> > > > >> material
> > > > >> > > > > concern,
> > > > >> > > > > >> I
> > > > >> > > > > >> am honestly not sure -- I am not a db specialist.  But
> it
> > > > does
> > > > >> > feel
> > > > >> > > > icky
> > > > >> > > > > >> to
> > > > >> > > > > >> add all the indexes when they are used by, I suspect, a
> > > very
> > > > >> small
> > > > >> > > > > >> percentage of airflow users overall.
> > > > >> > > > > >>
> > > > >> > > > > >> Part of this for me is also the reason these types of
> > > queries
> > > > >> are
> > > > >> > > run.
> > > > >> > > > > >> Typically it is for analytics purposes, either on the
> fly
> > > or
> > > > as
> > > > >> > part
> > > > >> > > > of
> > > > >> > > > > a
> > > > >> > > > > >> replication process to a DWH or something.  But there
> are
> > > > more
> > > > >> > > > efficient
> > > > >> > > > > >> mechanisms for this than pulling the data through the
> > > > >> webserver,
> > > > >> > > e.g.
> > > > >> > > > > >> read-only replicas, or standard ETL processes, or just
> > > making
> > > > >> > better
> > > > >> > > > use
> > > > >> > > > > >> of
> > > > >> > > > > >> the indexes that are already there.  Do we consider
> > > > >> "replication"
> > > > >> > a
> > > > >> > > > > >> reasonable use of the REST API?
> > > > >> > > > > >>
> > > > >> > > > > >> So I may very well be alone on this. But (1) the API
> is not
> > > > >> > > something
> > > > >> > > > > >> everyone uses.  And (2) even of users of the API,
> probably
> > > > >> only a
> > > > >> > > very
> > > > >> > > > > >> small percentage are using it in a way that needs these
> > > > >> indexes --
> > > > >> > > > > >> otherwise why haven't we seen requests for indexes on
> these
> > > > >> > columns
> > > > >> > > on
> > > > >> > > > > >> github?  So, to me, it doesn't seem crazy to just
> leave it
> > > to
> > > > >> the
> > > > >> > > > user /
> > > > >> > > > > >> platform to add.  But as usual, not a hill to die on.
> > > > >> > > > > >>
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Re: [DISCUSS] indexes for API calls

Reply via email to