Using UUIDS was the proposal how we can bypass the limitation of MYSQL for Airflow 2 when we discussed whether to do a "simple" version of team-prefix in dag id, or whether we want to mess with adding yet-another-field-to-indexes-that-are-already-too-long-for-mysql and it was based on the assumption we are going to that for versioning already, so I thought that is the plan :). Main reason why I put the multi-team voting on hold while we decide on the scope of Airflow 3 to be honest.
https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1714140873323569?thread_ts=1713607491.495499&cid=C06K9Q5G2UA J. On Fri, May 31, 2024 at 9:04 AM Daniel Standish <daniel.stand...@astronomer.io.invalid> wrote: > Yes uuid is risky and problematic as primary key. If you do it you need to > do carefully/ sequential. But I think that we are not going with UUID pk > on any tables at this time. > > BUT I do want to add a uuid for every TI try that is not PK but can be used > as a more convenient identifier when tying together log events related to > TIs emitted on different services. > > On Thu, May 30, 2024 at 11:53 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > Also if we are speaking about indexes - a bit tangential but I know we > were > > planning to replace some of the primary keys (mainly because of mysql > > limitations) with synthetic keys for DAG versioning casse where we > planned > > to use UUIDs). > > > > We should be very, very careful when doing it because I've learned > recently > > that due to UUID randomness the algorithm to rebalance binary tree do not > > work well with random UUIDs which means that when you do a lot of > inserts, > > rebalancing of the trees is happening very often and performance of > inserts > > is heavily affected. Postgres has a dedicated UUID field type for that > > that produces an optimal index > > > > > https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres > > - however it might or might not be affected. Similarly MariaDB, but MySQL > > does not seem to have proper UUID support, so we should really use UUID7 > > rather that UUID4 for such UUIDs in case we do not want to affect insert > > performance on MySQL. > > > > WAT???? Yeah. That is. - I think - rather non-obvious side-effect. > > > > This learning I had adds much more strength to two related discussions: > > > > 1) This only adds more fire to "let's only stick to Postgres". We can > > optimize the experience for our users way better and move faster. > > > > 2) Let's REALLY not put the power of deciding what indexes to put on the > > database of ours in the hands of our users. There are many unobvious > > consequences of adding indexes and while we can aim to learn, foresee and > > control them (mostly when we have a working performance test suite), > > getting a random user to put a random index on a random set of keys in > > Airflow DB might have a number of unobvious consequences which might give > > our users (and us) a LOT of headache. > > > > J. > > > > > > > > > > On Fri, May 31, 2024 at 8:13 AM Pankaj Koti > > <pankaj.k...@astronomer.io.invalid> wrote: > > > > > Addressing one of Pierre's questions: > > > > > > Should I index foreign keys? Is that done by default or should I > > explicitly > > > do it? > > > > > > The answer varies depending on the database engine. > > > PostgreSQL and SQLite do not add indexes on foreign keys > > > by default, while MySQL does. Developers should keep this > > > in mind. Many of us (myself included) may have assumed > > > from our university days, indexes are always created for > > > foreign keys, but that's not always the case. :) > > > > > > Best regards, > > > > > > *Pankaj Koti* > > > Senior Software Engineer (Airflow OSS Engineering team) > > > Location: Pune, Maharashtra, India > > > Timezone: Indian Standard Time (IST) > > > > > > > > > On Fri, May 31, 2024 at 11:37 AM Daniel Standish > > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > > > I would be in favor of this for sure. Let's see what others think :) > > > > > > > > On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <ja...@potiuk.com> > > wrote: > > > > > > > > > Simply speaking - let's make "lack of optimisation for these and > > that" > > > > part > > > > > of the API specification. > > > > > > > > > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <ja...@potiuk.com> > > wrote: > > > > > > > > > > > So let's document as part of the API which queries are not > > performant > > > > and > > > > > > suggest users that want to use them to make their analytics > queries > > > > > > elsewhere. I'd very much prefer that it's slow "by design" for > > > everyone > > > > > > rather than add option for the user to speed them up where we > > decided > > > > > > not to do it ourselves because we know the consequences. > > > > > > > > > > > > I think the root cause of the problem is `US` adding new > filtering > > > > > options > > > > > > to public API without thinking about and documenting > consequences. > > > > > > > > > > > > J. > > > > > > > > > > > > > > > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish > > > > > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > > > > > > > >> So I think the notion that *all possibly expensive queries* > should > > > > have > > > > > an > > > > > >> index to support them is not a tenable one. E.g. there are > > > something > > > > > like > > > > > >> 5 params on TI list endpoint that don't have an index. > > > > > >> > > > > > >> In contrast with queries from airflow itself, the API queries > are > > > more > > > > > >> arbitrary -- user can combine in any way. So it's not so simple > > to > > > > > >> determine just from the endpoint params what indexes will be > > needed. > > > > > And > > > > > >> we're not going to add an index for all possible param > > combinations, > > > > > that > > > > > >> would be silly. So I think like it or not I think we will > > > necessarily > > > > > >> draw > > > > > >> the line at *some* point and say we're trying to cover the most > > > > likely / > > > > > >> normal / reasonable uses and if your usage pattern is an outlier > > you > > > > > might > > > > > >> need to add an index yourself. And I don't think that's > > > unreasonable > > > > > >> > > > > > >> Maybe at least one conclusion from this is to add an index for > > each > > > of > > > > > >> start_date, end_date, and updated_at on TI. Already the storage > > > > > occupied > > > > > >> by indexes on TI is more than the table data size -- I don't > > > remember > > > > it > > > > > >> might have been like 2x or something. So then index count will > > > > increase > > > > > >> from 8 to 11 and overhead increases. Whether this is a material > > > > > concern, > > > > > >> I > > > > > >> am honestly not sure -- I am not a db specialist. But it does > > feel > > > > icky > > > > > >> to > > > > > >> add all the indexes when they are used by, I suspect, a very > small > > > > > >> percentage of airflow users overall. > > > > > >> > > > > > >> Part of this for me is also the reason these types of queries > are > > > run. > > > > > >> Typically it is for analytics purposes, either on the fly or as > > part > > > > of > > > > > a > > > > > >> replication process to a DWH or something. But there are more > > > > efficient > > > > > >> mechanisms for this than pulling the data through the webserver, > > > e.g. > > > > > >> read-only replicas, or standard ETL processes, or just making > > better > > > > use > > > > > >> of > > > > > >> the indexes that are already there. Do we consider > "replication" > > a > > > > > >> reasonable use of the REST API? > > > > > >> > > > > > >> So I may very well be alone on this. But (1) the API is not > > > something > > > > > >> everyone uses. And (2) even of users of the API, probably only > a > > > very > > > > > >> small percentage are using it in a way that needs these indexes > -- > > > > > >> otherwise why haven't we seen requests for indexes on these > > columns > > > on > > > > > >> github? So, to me, it doesn't seem crazy to just leave it to > the > > > > user / > > > > > >> platform to add. But as usual, not a hill to die on. > > > > > >> > > > > > > > > > > > > > > > > > > > > >