Greetings!
During migrating our ORM syntax to compatible with SQLAlchemy 2.0 I
probably found skeletons in the closet.
Let's start from the beginning, initially I got this warning
airflow/models/renderedtifields.py:245 RemovedIn20Warning('ORDER BY columns
added implicitly due to DISTINCT is deprecated and will be removed in
SQLAlchemy 2.0. SELECT statements with DISTINCT should be written to
explicitly include the appropriate columns in the columns clause
(Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)')
"OK let's fix it!", I thought at first and started to investigate
RenderedTaskInstanceFields model
*Skeleton #1:*
When I first time look on the code and comments it got me to thinking that
part which keep only latest N Rendered Task Fields potentially could lead
different performance degradation (Locks, Dead Locks, Data Bloating): see
code
https://github.com/apache/airflow/blob/6c479437b1aedf74d029463bda56b42950278287/airflow/models/renderedtifields.py#L185-L245
Also this historical part (from Airflow 1.10.10) generate this SQL
Statement (pg backend)
DELETE FROM rendered_task_instance_fields
WHERE rendered_task_instance_fields.dag_id = %(dag_id_1) s
AND rendered_task_instance_fields.task_id = %(task_id_1) s
AND (
(
rendered_task_instance_fields.dag_id,
rendered_task_instance_fields.task_id,
rendered_task_instance_fields.run_id
) NOT IN (
SELECT
anon_1.dag_id,
anon_1.task_id,
anon_1.run_id
FROM
(
SELECT DISTINCT
rendered_task_instance_fields.dag_id AS dag_id,
rendered_task_instance_fields.task_id AS task_id,
rendered_task_instance_fields.run_id AS run_id,
dag_run.execution_date AS execution_date
FROM rendered_task_instance_fields
JOIN dag_run ON rendered_task_instance_fields.dag_id =
dag_run.dag_id
AND rendered_task_instance_fields.run_id = dag_run.run_id
WHERE
rendered_task_instance_fields.dag_id = %(dag_id_2) s
AND rendered_task_instance_fields.task_id = %(task_id_2) s
ORDER BY
dag_run.execution_date DESC
limit %(param_1) s
) AS anon_1
)
)
Which is especially not effective in PostgreSQL. When IN SUBQUERY could be
easily transform internaly into SEMI-JOIN (aka EXISTS clause), but it is
not working for NOT IN SUBQUERY because it is not transformed into ANTI
JOIN (aka NOT EXISTS clause) even if it possible, see:
https://commitfest.postgresql.org/27/2023/
I didn't do any performance benchmarks yet but I guess if users set
AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK to 0 rather than default
30 it could improve performance and reduce number of DeadLocks, however the
table size will increase but I think we don't do any maintenance job for
other tables.
Potentially it is a good idea to deprecate this option and recommend for
users to set it to 0? WDYT? Maybe someone has already tried or investigated
this?
*Skeleton #2:*
We have a k8s_pod_yaml field which is exclusively used by K8S executors.
Should we also decouple this field as part of AIP-51?
----
Best Wishes
*Andrey Anshin*