nevcohen opened a new issue, #45636: URL: https://github.com/apache/airflow/issues/45636
### Apache Airflow version 2.10.4 ### If "Other Airflow 2 version" selected, which one? 2.9.3 ### What happened? We have a problem with our airflow cluster. When the scheduler extracts and filters the tasks for scheduling even though the pools are not in a state of starvation then certain tasks in certain pools get stuck in scheduled. This happens due to the following situation: There are two pools, each with 20 slots, on the first there are 1000 tasks ready for scheduling and each of them has a priority of 2, on the second pool there are 100 tasks ready for scheduling and each of them with a priority of 1. The first pool all the slots are currently occupied except for 1 and the second, all slots are free. What happens is, the scheduler sorts the 1100 tasks that are waiting in scheduled and each time pulls 32 (we set max_ti=32), but all the first 32 that the scheduler actually pulled after filtering belong to the first pool (it is not starving because some of its tasks are short and therefore always has a few single slots free) so it turns out that the tasks of the second pool are stuck in scheduled without being able to move forward, and in the first pool he gets 32 tasks each time, but in practice he can run a few individual tasks (2/3 tasks). The same can be said about dags with lots of tasks but with very low concurrency. ### What you think should happen instead? Don't [filter](https://github.com/apache/airflow/blob/main/airflow%2Fjobs%2Fscheduler_job_runner.py#L360) the tasks according to the pools/dags/tasks that are **starving**, but check how many **free slots** there are in each and filter (starved pool/dag/task) the amount of tasks according to each of them, [before](https://github.com/apache/airflow/blob/main/airflow%2Fjobs%2Fscheduler_job_runner.py#L374) `query.limit(max_tis)` ### How to reproduce Pool PA: 3 slots Pool PB: 20 slots Dag DA: with 1000 shorts tasks, each task with priority 2 on pool PA. Dag DB: with 100 shorts tasks, each task with priority 1 on pool PB. Max active tasks = 32 ### Operating System Debian GNU/Linux 12 (bookworn) ### Versions of Apache Airflow Providers apache-airflow-providers-cncf-kubernetes==8.4.2 ### Deployment Official Apache Airflow Helm Chart ### Deployment details Python 3.8 executor: "KubernetesExecutor" scheduler: - max_tis_per_query: 32 - schedule_after_task_execution: 'False' - task_queued_timeout_chheck_interval: 600 kubernetes_executor: - worker_pods_ceartion_batch_size: 16 ### Anything else? _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
