Re: [DISCUSS] indexes for API calls

Daniel Standish Fri, 31 May 2024 14:45:38 -0700

Lots of good discussion here.  We should create separate threads for the
questions about (1) whether to keep or drop mysql / mssql / sqlite /
mongodb / just-pipe-to-/dev/null/ and (2) to UUID or not to UUID and (3)
database agnosticism a.k.a. an interface.


But some responses...

Using UUIDS was the proposal how we can bypass the limitation of MYSQL for
> Airflow 2 when we discussed whether to do a "simple" version of team-prefix
> in dag id, or whether we want to mess with adding
> yet-another-field-to-indexes-that-are-already-too-long-for-mysql and it was
> based on the assumption we are going to that for versioning already, so I
> thought that is the plan :). Main reason why I put the multi-team voting on
> hold while we decide on the scope of Airflow 3 to be honest.


Yeah I think that folks *thought* that we would be forced to go UUID by
adding try_number to grain of a "try history" type of table but that turned
out not to be the case so I don't think any of the dag versioning AIPs will
result in a new UUID PK, at least not in 2.x.  We realized we could achieve
the user-facing aims of those AIPs in 2.10 without a big painful migration
and figured best to avoid such a thing unless necessary in 2.x since there
will be migration pain anyway in 3.0 and we'll have more freedom then
anyway (now that it appears the community is coalescing around 3.0
happening).

Also just to be sure, the slowness you observe for listing task instances
> only appears with a certain combination of filters, while some other
> combinations, yielding a comparable result set in terms of 'size' are much
> quicker ? (Just trying to understand if the slowness comes from missing
> index, or also possibly from missing loading options that would cause lazy
> loading at serialization time slowing everything down, I don't see a lot of
> eager loading to prevent those, but maybe the few that we have are enough).


TI has no index on end_date.  If the table is sufficiently large, and you
list TIs within end_date range, it will be slow.

If we would like to add individual indexes for one field we bump into the
> problem that this are less effective when it comes to a filter/sorting over
> multiple different fields....


Yea, so this is sorta why I'd prefer if our position is that our indexing
strategy is to optimize specifically the the core operations of airflow,
mainly airflow-generated queries, or API calls that are operational -- such
as triggering a dag run, or mucking with datasets and the like.  But for
things like "just pulling data out of the db" using whatever filters you
want to use....  that feels more like a gray area that in a perfect world
we would not have to care about.  One compromise approach could could be we
have one indexed column for data takeout -- updated_at.  If every table had
an updated_at column that was indexed then a downstream process could
efficiently track changes and maybe we say "that's as good as you're going
to get" from us re data take out.  Then of course we might encounter
unforced errors when updated_at doesn't work right but that's another
story....

Rather than add these indexes (and every other "performance optimisation"
> measures) we should prove that this action gains benefits, rather than the
> opposite.


It's easy to prove that adding an index on end_date helps a filter on end
date for sufficiently large table.  It's 100%.  But the problem is it's not
100% of users that it helps because most of them don't do this.  This is
the sort of the main issue of this discussion thread: do we add the index
when it has 100% chance of helping (made-up number ...) 5% of people, but
all cost and no benefit for the rest?

Re: [DISCUSS] indexes for API calls

Reply via email to