Hi guys, I'd like to proposal a few improvements to Airflow that would help to scale Airflow:
Scheduler: 1. - Problem: scheduler loop became slow when # of running task grows too large, thus slows down DAG parsing/scheduler loop and creates scheduling delay, AIRFLOW-2156 <https://issues.apache.org/jira/browse/AIRFLOW-2156> - Proposal: Parallelize celery querying. - Progress: Dan Davydov( @aoen) has made a change to parallelize celery querying and we have been running with it in production for 1+ month. It solved scheduling delay problem we have in production when we have ~15k running task at peak and has been proven in our stress testing cluster to be able to handle ~30k running task. We have 10x+ performance improvement on celery querying with 16 subprocess querying celery and that can be configured. 2. - Problem: DAG parsing loop coupled with scheduler loop, thus places bottleneck on DAG parsing and creates scheduling delay, AIRFLOW-2760 <https://issues.apache.org/jira/browse/AIRFLOW-2760> - Proposal: Decouple DAG parsing loop and scheduler loop. - Progress: Prototype worked locally. 3. - Problem: scheduler loop became slow when # of tasks needed to be queued became too large, thus slows down DAG parsing/scheduler loop and creates scheduling delay, AIRFLOW-2761 <https://issues.apache.org/jira/browse/AIRFLOW-2761> - Proposal: Parallelize celery enqueuing. - Progress: Not started yet. Planned for Q3. Webserver: 1. - Problem: Webserver parses DagBag twice during start up, thus causes webserver start up being slow with large # of DAG files, AIRFLOW-2615 <https://issues.apache.org/jira/browse/AIRFLOW-2615> - Proposal: Remove the redundant DagBag parsing. - Progress: Tried an attempt <https://github.com/apache/incubator-airflow/pull/3506> but failed. Planned for Q3. 2. - Problem: Webserver parses DagBag in a single thread fashion, thus causes webserver start up being slow with large # of DAG files, AIRFLOW-2762 <https://issues.apache.org/jira/browse/AIRFLOW-2762> - Proposal: Parallelize DagBag parsing in webserver. Because not all DAGs are pickable so webserver will thus lose access to the actual DAG object, but only worker should need to use the actual DAG object. - Progress: Not started yet. Planned for Q3. Feedbacks are hugely appreciated. Cheers, Kevin Y