[ https://issues.apache.org/jira/browse/AIRFLOW-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Imberman closed AIRFLOW-72. ---------------------------------- Resolution: Auto Closed > Implement proper capacity scheduler > ----------------------------------- > > Key: AIRFLOW-72 > URL: https://issues.apache.org/jira/browse/AIRFLOW-72 > Project: Apache Airflow > Issue Type: Improvement > Components: scheduler > Reporter: Bolke de Bruin > Priority: Major > Labels: pool, queue, scheduler > Fix For: 2.0.0 > > > The scheduler is supposed to maintain queues and pools according to a > "capacity" model. However it is currently not properly implemented as > therefore issues as being able to oversubscribe to pools exist, race > conditions for queuing/dequeuing exist and probably others. > This Jira Epic is to track all related issues to pooling/queuing and the > (tbd) roadmap to a proper capacity scheduler. > Why queuing / scheduling broken: > Locking is not properly implemented and cannot be as a check for slot > availability is spread throughout the scheduler, taskinstance and executor. > This makes obtaining a slot non-atomic and results in over subscribing. In > addition it leads to race conditions as having two tasks being picked from > the queue at the same time as the scheduler determines that a queued task > still needs to be send to the executor, while in an earlier run this already > happened. > In order to fix this Pool handling needs to be centralized (code wise) and > work with a mutex (with_for_update()) on the database records. The > scheduler/taskinstance can then do something like: > slot = Pool.obtain_slot(pool_id) > Pool.release_slot(slot) -- This message was sent by Atlassian Jira (v8.3.4#803005)