Yes. I agree. Airflow REST API does not have to be super efficient with the
exception of designated calls which are really designed to be efficient. We
might want to speed up some things for the Airflow Ui purposes for example.
And it should always be explicit decision when we add new filtering etc.

But since we know users will have some needs there - we should be explicit
about what they can and cannot do.

I think it's just worth documenting as design decision. And either
explicitly mark those 'designed to be efficient' or those that may be not
efficient. (The latter is probably better - we likely have just a few of
those filters).

Then we should tell the users who need some.of the efficiency might make
other decisios - for example extracting the data separately.inceementaly to
another data source and querying it from there with their own offline
tables and indexes. And we should tell them 'do not modify airflow db to
achieve efficiency - this is not the way to go and will break things''

This way nobody can complain about it - when we explicity state a design
decision and tell the users what they can do on their own to get what they
want (without them.modifying internal DB) - this seems like a complete
solution.

Explicit is better than implicit.


J.

wt., 18 cze 2024, 00:58 użytkownik Daniel Standish
<daniel.stand...@astronomer.io.invalid> napisał:

> I tinkered with this a bit.  I think the case to add indexes for "REST" API
> may be weaker than even *I* thought (and I thought it was weak / was mostly
> against it).
>
> The reason is cus there are reasons beyond indexing why such API calls are
> slow, and adding indexes may not even help that much.  For example the TI
> list endpoint lists TIs. TIs have a *lot* of information distributed across
> many tables.  We always load all of it.  Sometimes multiple queries to get
> it.  That's just not going to be an efficient way to implement change data
> capture a.k.a. replication.
>
> I think the main responsibility of airflow is to be performant with regard
> to task execution broadly defined.  Anything beyond that is sort of not
> relevant.  And getting data out of the metastore for ancillary purposes is
> outside that line IMO.  The range of possible queries etc are too varied to
> reasonably optimize for.  So it's best left to the person wearing the "dba"
> hat at the organization.
>
> Users can also add endpoints via plugins that query the data more
> efficiently.
>
> That's my take in a nutshell I guess.
>

Reply via email to