[GitHub] [airflow] coufon commented on issue #5594: [AIRFLOW-4924] Loading DAGs asynchronously in Airflow webserver

GitBox Tue, 16 Jul 2019 10:02:38 -0700

coufon commented on issue #5594: [AIRFLOW-4924] Loading DAGs asynchronously in 
Airflow webserver
URL: https://github.com/apache/airflow/pull/5594#issuecomment-511901734
 
 
   Hi Sumit, thanks for your help in reviewing. To answer your questions:
   
   > * So each gunicorn worker going to launch a separate thread to parse DAGs 
asynchronously, and not a single background process going to do the DAG parsing 
for all the workers.
   
   Yes. Each gunicorn worker now has a separate process to collect DAGs. A 
single background process is possible with more code change in 
airflow/www/views.py. We can discuss whether it is needed. 
   
   One drawback of using a process for each gunicorn worker is that the 
'collect dags' process consumes most of memory. Usually users want to use 
async_dag_loader when they have many DAGs (>1,000), so the webserver is very 
memory intense. So we also suggest users to use [webserver] workers = 1.
   
   > * The pickled DAG object won't be stored into DB and just used for thread 
communication.
   
   No. They are not stored in DB and just in memory now. We are working on 
storing them in DB (just like DagPickle in Airflow, but we store stringified 
DAGs instead), so UI does not need to run DAG code any more. Scheduler can also 
use stringified DAGs to accelerate scheduling actions (to be tested out).
   
   > * As of now each gunicorn worker might have diff copy of DAGs (due to 
changes) and cause confusion on UI, something which isn't going to solve with 
this change.
   
   Yes. This PR does not change this inconsistency problem. The solution can be 
a future PR that:
   (1) as you mentioned, use a single DAG collecting process for all gunicorn 
workers;
   or (2) reading cached DAGs from DB in webserver
   
   > * How it going to behave if DAG parsing time for each thread is more than 
the configured `collect_dags_interval` time.
   
   If the process takes longer than collect_dags_interval, it will immediately 
start the next round of collection. There is no sleep between two rounds of 
collection.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [airflow] coufon commented on issue #5594: [AIRFLOW-4924] Loading DAGs asynchronously in Airflow webserver

Reply via email to