Gabriel Silk created AIRFLOW-2430: ------------------------------------- Summary: Bad query patterns at scale prevent scheduler from starting Key: AIRFLOW-2430 URL: https://issues.apache.org/jira/browse/AIRFLOW-2430 Project: Apache Airflow Issue Type: Bug Components: scheduler Reporter: Gabriel Silk
h2. Summary Certain queries executed by the scheduler do not scale well with the number of tasks being operated on. Two example functions * reset_state_for_orphaned_tasks * _execute_task_instances Concretely — with a mere 75k tasks being operated on, the first query can take dozens of minutes to run, blocking the scheduler from making progress. The cause is twofold: 1. As the query grows past a certain point, the MySQL planner will choose to do a full table scan as opposed to using an index. I assume the same is true of Postgres. 2. The query predicate size grows linearly in the number of tasks being operated, thus increasing the amount of work that needs to be done per row. In a sense, you’re left with an operation that scales O(n^2) h2. Proposed Fix It appears that one of these bad query patterns was fixed in [3547cbffd|https://github.com/apache/incubator-airflow/commit/3547cbffdbffac2f98a8aa05526e8c9671221025] by introducing a configurable batch size with can be set via max_tis_per_query. I propose we extend the suggested fix to include other poorly-performing queries in the scheduler. I’ve identified two queries that are directly affecting my work and included them in the diff, though the same approach can be extended to more queries as we see fit. Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)