Yes uuid is risky and problematic as primary key. If you do it you need to
do carefully/ sequential.  But I think that we are not going with UUID pk
on any tables at this time.

BUT I do want to add a uuid for every TI try that is not PK but can be used
as a more convenient identifier when tying together log events related to
TIs emitted on different services.

On Thu, May 30, 2024 at 11:53 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Also if we are speaking about indexes - a bit tangential but I know we were
> planning to replace some of the primary keys (mainly because of mysql
> limitations) with synthetic keys for DAG versioning casse where we planned
> to use UUIDs).
>
> We should be very, very careful when doing it because I've learned recently
> that due to UUID randomness the algorithm to rebalance binary tree do not
> work well with random UUIDs which means that when you do a lot of inserts,
> rebalancing of the trees is happening very often and performance of inserts
> is heavily affected.  Postgres has a dedicated UUID field type for that
> that produces an optimal index
>
> https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres
> - however it might or might not be affected. Similarly MariaDB, but MySQL
> does not seem to have proper UUID support, so we should really use UUID7
> rather that UUID4 for such UUIDs in case we do not want to affect insert
> performance on MySQL.
>
> WAT???? Yeah. That is. - I think - rather non-obvious side-effect.
>
> This learning I had adds much more strength to two related discussions:
>
> 1) This only adds more fire to "let's only stick to Postgres". We can
> optimize the experience for our users way better and move faster.
>
> 2) Let's REALLY not put the power of deciding what indexes to put on the
> database of ours in the hands of our users. There are many unobvious
> consequences of adding indexes and while we can aim to learn, foresee and
> control them (mostly when we have a working performance test suite),
> getting a random user to put a random index on a random set of keys in
> Airflow DB might have a number of unobvious consequences which might give
> our users (and us) a LOT of headache.
>
> J.
>
>
>
>
> On Fri, May 31, 2024 at 8:13 AM Pankaj Koti
> <pankaj.k...@astronomer.io.invalid> wrote:
>
> > Addressing one of Pierre's questions:
> >
> > Should I index foreign keys? Is that done by default or should I
> explicitly
> > do it?
> >
> > The answer varies depending on the database engine.
> > PostgreSQL and SQLite do not add indexes on foreign keys
> > by default, while MySQL does. Developers should keep this
> > in mind. Many of us (myself included) may have assumed
> > from our university days, indexes are always created for
> > foreign keys, but that's not always the case. :)
> >
> > Best regards,
> >
> > *Pankaj Koti*
> > Senior Software Engineer (Airflow OSS Engineering team)
> > Location: Pune, Maharashtra, India
> > Timezone: Indian Standard Time (IST)
> >
> >
> > On Fri, May 31, 2024 at 11:37 AM Daniel Standish
> > <daniel.stand...@astronomer.io.invalid> wrote:
> >
> > > I would be in favor of this for sure.  Let's see what others think :)
> > >
> > > On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > >
> > > > Simply speaking - let's make "lack of optimisation for these and
> that"
> > > part
> > > > of the API specification.
> > > >
> > > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > > >
> > > > > So let's document as part of the API which queries are not
> performant
> > > and
> > > > > suggest users that want to use them to make their analytics queries
> > > > > elsewhere. I'd very much prefer that it's slow "by design" for
> > everyone
> > > > > rather than add option for the user to speed them up where we
> decided
> > > > > not to do it ourselves because we know the consequences.
> > > > >
> > > > > I think the root cause of the problem is `US` adding new filtering
> > > > options
> > > > > to public API without thinking about and documenting consequences.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish
> > > > > <daniel.stand...@astronomer.io.invalid> wrote:
> > > > >
> > > > >> So I think the notion that *all possibly expensive queries* should
> > > have
> > > > an
> > > > >> index to support them is not a tenable one.  E.g. there are
> > something
> > > > like
> > > > >> 5 params on TI list endpoint that don't have an index.
> > > > >>
> > > > >> In contrast with queries from airflow itself, the API queries are
> > more
> > > > >> arbitrary -- user can combine in any way.  So it's not so simple
> to
> > > > >> determine just from the endpoint params what indexes will be
> needed.
> > > > And
> > > > >> we're not going to add an index for all possible param
> combinations,
> > > > that
> > > > >> would be silly.  So I think like it or not I think we will
> > necessarily
> > > > >> draw
> > > > >> the line at *some* point and say we're trying to cover the most
> > > likely /
> > > > >> normal / reasonable uses and if your usage pattern is an outlier
> you
> > > > might
> > > > >> need to add an index yourself.  And I don't think that's
> > unreasonable
> > > > >>
> > > > >> Maybe at least one conclusion from this is to add an index for
> each
> > of
> > > > >> start_date, end_date, and updated_at on TI.  Already the storage
> > > > occupied
> > > > >> by indexes on TI is more than the table data size -- I don't
> > remember
> > > it
> > > > >> might have been like 2x or something.  So then index count will
> > > increase
> > > > >> from 8 to 11 and overhead increases.  Whether this is a material
> > > > concern,
> > > > >> I
> > > > >> am honestly not sure -- I am not a db specialist.  But it does
> feel
> > > icky
> > > > >> to
> > > > >> add all the indexes when they are used by, I suspect, a very small
> > > > >> percentage of airflow users overall.
> > > > >>
> > > > >> Part of this for me is also the reason these types of queries are
> > run.
> > > > >> Typically it is for analytics purposes, either on the fly or as
> part
> > > of
> > > > a
> > > > >> replication process to a DWH or something.  But there are more
> > > efficient
> > > > >> mechanisms for this than pulling the data through the webserver,
> > e.g.
> > > > >> read-only replicas, or standard ETL processes, or just making
> better
> > > use
> > > > >> of
> > > > >> the indexes that are already there.  Do we consider "replication"
> a
> > > > >> reasonable use of the REST API?
> > > > >>
> > > > >> So I may very well be alone on this. But (1) the API is not
> > something
> > > > >> everyone uses.  And (2) even of users of the API, probably only a
> > very
> > > > >> small percentage are using it in a way that needs these indexes --
> > > > >> otherwise why haven't we seen requests for indexes on these
> columns
> > on
> > > > >> github?  So, to me, it doesn't seem crazy to just leave it to the
> > > user /
> > > > >> platform to add.  But as usual, not a hill to die on.
> > > > >>
> > > > >
> > > >
> > >
> >
>

Reply via email to