nevcohen opened a new issue, #45636:
URL: https://github.com/apache/airflow/issues/45636

   ### Apache Airflow version
   
   2.10.4
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.9.3
   
   ### What happened?
   
   We have a problem with our airflow cluster.
   When the scheduler extracts and filters the tasks for scheduling even though 
the pools are not in a state of starvation then certain tasks in certain pools 
get stuck in scheduled.
   
   This happens due to the following situation:
   
   There are two pools, each with 20 slots, on the first there are 1000 tasks 
ready for scheduling and each of them has a priority of 2, on the second pool 
there are 100 tasks ready for scheduling and each of them with a priority of 1.
   The first pool all the slots are currently occupied  except for 1 and the 
second, all slots are free.
   
   What happens is, the scheduler sorts the 1100 tasks that are waiting in 
scheduled and each time pulls 32 (we set max_ti=32), but all the first 32 that 
the scheduler actually pulled after filtering belong to the first pool (it is 
not starving because some of its tasks are short and therefore always has a few 
single slots free) so it turns out that the tasks of the second pool are stuck 
in scheduled without being able to move forward, and in the first pool he gets 
32 tasks each time, but in practice he can run a few individual tasks (2/3 
tasks).
   
   The same can be said about dags with lots of tasks but with very low 
concurrency.
   
   ### What you think should happen instead?
   
   Don't 
[filter](https://github.com/apache/airflow/blob/main/airflow%2Fjobs%2Fscheduler_job_runner.py#L360)
 the tasks according to the pools/dags/tasks that are **starving**, but check 
how many **free slots** there are in each and filter (starved pool/dag/task) 
the amount of tasks according to each of them, 
[before](https://github.com/apache/airflow/blob/main/airflow%2Fjobs%2Fscheduler_job_runner.py#L374)
 `query.limit(max_tis)`
   
   ### How to reproduce
   
   Pool PA: 3 slots
   
   Pool PB: 20 slots
   
   Dag DA: with 1000 shorts tasks, each task with priority 2 on pool PA.
   
   Dag DB: with 100 shorts tasks, each task with priority 1 on pool PB.
   
   Max active tasks = 32
   
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworn)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==8.4.2
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Python 3.8
   
   executor: "KubernetesExecutor"
   
   scheduler:
   
   - max_tis_per_query: 32
   - schedule_after_task_execution: 'False'
   - task_queued_timeout_chheck_interval: 600
   
   kubernetes_executor:
   
   - worker_pods_ceartion_batch_size: 16
   
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to