Lots of good discussion here. We should create separate threads for the questions about (1) whether to keep or drop mysql / mssql / sqlite / mongodb / just-pipe-to-/dev/null/ and (2) to UUID or not to UUID and (3) database agnosticism a.k.a. an interface.
But some responses... Using UUIDS was the proposal how we can bypass the limitation of MYSQL for > Airflow 2 when we discussed whether to do a "simple" version of team-prefix > in dag id, or whether we want to mess with adding > yet-another-field-to-indexes-that-are-already-too-long-for-mysql and it was > based on the assumption we are going to that for versioning already, so I > thought that is the plan :). Main reason why I put the multi-team voting on > hold while we decide on the scope of Airflow 3 to be honest. Yeah I think that folks *thought* that we would be forced to go UUID by adding try_number to grain of a "try history" type of table but that turned out not to be the case so I don't think any of the dag versioning AIPs will result in a new UUID PK, at least not in 2.x. We realized we could achieve the user-facing aims of those AIPs in 2.10 without a big painful migration and figured best to avoid such a thing unless necessary in 2.x since there will be migration pain anyway in 3.0 and we'll have more freedom then anyway (now that it appears the community is coalescing around 3.0 happening). Also just to be sure, the slowness you observe for listing task instances > only appears with a certain combination of filters, while some other > combinations, yielding a comparable result set in terms of 'size' are much > quicker ? (Just trying to understand if the slowness comes from missing > index, or also possibly from missing loading options that would cause lazy > loading at serialization time slowing everything down, I don't see a lot of > eager loading to prevent those, but maybe the few that we have are enough). TI has no index on end_date. If the table is sufficiently large, and you list TIs within end_date range, it will be slow. If we would like to add individual indexes for one field we bump into the > problem that this are less effective when it comes to a filter/sorting over > multiple different fields.... Yea, so this is sorta why I'd prefer if our position is that our indexing strategy is to optimize specifically the the core operations of airflow, mainly airflow-generated queries, or API calls that are operational -- such as triggering a dag run, or mucking with datasets and the like. But for things like "just pulling data out of the db" using whatever filters you want to use.... that feels more like a gray area that in a perfect world we would not have to care about. One compromise approach could could be we have one indexed column for data takeout -- updated_at. If every table had an updated_at column that was indexed then a downstream process could efficiently track changes and maybe we say "that's as good as you're going to get" from us re data take out. Then of course we might encounter unforced errors when updated_at doesn't work right but that's another story.... Rather than add these indexes (and every other "performance optimisation" > measures) we should prove that this action gains benefits, rather than the > opposite. It's easy to prove that adding an index on end_date helps a filter on end date for sufficiently large table. It's 100%. But the problem is it's not 100% of users that it helps because most of them don't do this. This is the sort of the main issue of this discussion thread: do we add the index when it has 100% chance of helping (made-up number ...) 5% of people, but all cost and no benefit for the rest?