dstandish commented on PR #53492: URL: https://github.com/apache/airflow/pull/53492#issuecomment-3161258751
You seem to pretty quickly dismiss the notion that the query is .... pretty gnarly. You don't think it can be simplified? Or maybe you just want to get some data on performance first. It's also, you know, I'm not sure that piling so much into one query is really required to solve starvation problems. But it is not a positive for intelligibility, and can make it harder to optimize. The nesting in particular I think is pretty tough for intelligibility. Maybe it's necessary but I'd be surprised. I did notice you are missing a join condition here: ``` JOIN dag_run ON task_instance.run_id = dag_run.run_id ``` This would result in join explosion if you have same run id for diff dags (and since airlfow's run id is derived from a date, it happens commonly). You may be able to avoid some joins to dag run if you are just after TI. Some of our relationships are eager loading and result in joins we don't need (but it can be disabled with query options). After taking a little closer look at the query.... I was struck by how often we go back and do a row number over the task instance table. Is that really necessary? I would have expected that maybe once we grab the pool of candidate TIs in the innermost select, that we'd be able to just successively filter that down to the final set, even if we do have to involve some nested window functions. Over and over it's select *, row_number() from TI, DR, inner_query. Is there no way to stop going back to TI and DR after the first result set? It also can be a nice thing to do (for readability and debugging) to provide an alias or label for your subselects and ctes (e.g. to avoid anon_6 etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
