Yes uuid is risky and problematic as primary key. If you do it you need to do carefully/ sequential. But I think that we are not going with UUID pk on any tables at this time.
BUT I do want to add a uuid for every TI try that is not PK but can be used as a more convenient identifier when tying together log events related to TIs emitted on different services. On Thu, May 30, 2024 at 11:53 PM Jarek Potiuk <ja...@potiuk.com> wrote: > Also if we are speaking about indexes - a bit tangential but I know we were > planning to replace some of the primary keys (mainly because of mysql > limitations) with synthetic keys for DAG versioning casse where we planned > to use UUIDs). > > We should be very, very careful when doing it because I've learned recently > that due to UUID randomness the algorithm to rebalance binary tree do not > work well with random UUIDs which means that when you do a lot of inserts, > rebalancing of the trees is happening very often and performance of inserts > is heavily affected. Postgres has a dedicated UUID field type for that > that produces an optimal index > > https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres > - however it might or might not be affected. Similarly MariaDB, but MySQL > does not seem to have proper UUID support, so we should really use UUID7 > rather that UUID4 for such UUIDs in case we do not want to affect insert > performance on MySQL. > > WAT???? Yeah. That is. - I think - rather non-obvious side-effect. > > This learning I had adds much more strength to two related discussions: > > 1) This only adds more fire to "let's only stick to Postgres". We can > optimize the experience for our users way better and move faster. > > 2) Let's REALLY not put the power of deciding what indexes to put on the > database of ours in the hands of our users. There are many unobvious > consequences of adding indexes and while we can aim to learn, foresee and > control them (mostly when we have a working performance test suite), > getting a random user to put a random index on a random set of keys in > Airflow DB might have a number of unobvious consequences which might give > our users (and us) a LOT of headache. > > J. > > > > > On Fri, May 31, 2024 at 8:13 AM Pankaj Koti > <pankaj.k...@astronomer.io.invalid> wrote: > > > Addressing one of Pierre's questions: > > > > Should I index foreign keys? Is that done by default or should I > explicitly > > do it? > > > > The answer varies depending on the database engine. > > PostgreSQL and SQLite do not add indexes on foreign keys > > by default, while MySQL does. Developers should keep this > > in mind. Many of us (myself included) may have assumed > > from our university days, indexes are always created for > > foreign keys, but that's not always the case. :) > > > > Best regards, > > > > *Pankaj Koti* > > Senior Software Engineer (Airflow OSS Engineering team) > > Location: Pune, Maharashtra, India > > Timezone: Indian Standard Time (IST) > > > > > > On Fri, May 31, 2024 at 11:37 AM Daniel Standish > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > I would be in favor of this for sure. Let's see what others think :) > > > > > > On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <ja...@potiuk.com> > wrote: > > > > > > > Simply speaking - let's make "lack of optimisation for these and > that" > > > part > > > > of the API specification. > > > > > > > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <ja...@potiuk.com> > wrote: > > > > > > > > > So let's document as part of the API which queries are not > performant > > > and > > > > > suggest users that want to use them to make their analytics queries > > > > > elsewhere. I'd very much prefer that it's slow "by design" for > > everyone > > > > > rather than add option for the user to speed them up where we > decided > > > > > not to do it ourselves because we know the consequences. > > > > > > > > > > I think the root cause of the problem is `US` adding new filtering > > > > options > > > > > to public API without thinking about and documenting consequences. > > > > > > > > > > J. > > > > > > > > > > > > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish > > > > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > > > > > >> So I think the notion that *all possibly expensive queries* should > > > have > > > > an > > > > >> index to support them is not a tenable one. E.g. there are > > something > > > > like > > > > >> 5 params on TI list endpoint that don't have an index. > > > > >> > > > > >> In contrast with queries from airflow itself, the API queries are > > more > > > > >> arbitrary -- user can combine in any way. So it's not so simple > to > > > > >> determine just from the endpoint params what indexes will be > needed. > > > > And > > > > >> we're not going to add an index for all possible param > combinations, > > > > that > > > > >> would be silly. So I think like it or not I think we will > > necessarily > > > > >> draw > > > > >> the line at *some* point and say we're trying to cover the most > > > likely / > > > > >> normal / reasonable uses and if your usage pattern is an outlier > you > > > > might > > > > >> need to add an index yourself. And I don't think that's > > unreasonable > > > > >> > > > > >> Maybe at least one conclusion from this is to add an index for > each > > of > > > > >> start_date, end_date, and updated_at on TI. Already the storage > > > > occupied > > > > >> by indexes on TI is more than the table data size -- I don't > > remember > > > it > > > > >> might have been like 2x or something. So then index count will > > > increase > > > > >> from 8 to 11 and overhead increases. Whether this is a material > > > > concern, > > > > >> I > > > > >> am honestly not sure -- I am not a db specialist. But it does > feel > > > icky > > > > >> to > > > > >> add all the indexes when they are used by, I suspect, a very small > > > > >> percentage of airflow users overall. > > > > >> > > > > >> Part of this for me is also the reason these types of queries are > > run. > > > > >> Typically it is for analytics purposes, either on the fly or as > part > > > of > > > > a > > > > >> replication process to a DWH or something. But there are more > > > efficient > > > > >> mechanisms for this than pulling the data through the webserver, > > e.g. > > > > >> read-only replicas, or standard ETL processes, or just making > better > > > use > > > > >> of > > > > >> the indexes that are already there. Do we consider "replication" > a > > > > >> reasonable use of the REST API? > > > > >> > > > > >> So I may very well be alone on this. But (1) the API is not > > something > > > > >> everyone uses. And (2) even of users of the API, probably only a > > very > > > > >> small percentage are using it in a way that needs these indexes -- > > > > >> otherwise why haven't we seen requests for indexes on these > columns > > on > > > > >> github? So, to me, it doesn't seem crazy to just leave it to the > > > user / > > > > >> platform to add. But as usual, not a hill to die on. > > > > >> > > > > > > > > > > > > > > >