coufon commented on issue #5594: [AIRFLOW-4924] Loading DAGs asynchronously in 
Airflow webserver
URL: https://github.com/apache/airflow/pull/5594#issuecomment-511909913
 
 
   Hi Jarek, thanks for your comments. Here are my thoughts:
   
   > a starting point to implement part of DAG persistence
   
   We are working on storing 'stringified DAG' into DB to be used by webserver 
and scheduler. We found it is straightforward now because 'stringified DAG' is 
always picklable. I will send out an AIP soon. This change (still use current 
DAG classes) is not as fundamental as:
   
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
   
   > maybe you can share your experiences with an actual "production" usage of 
this?
   
   We implement async_dag_loader in Composer because we observe there are more 
and more users running a large amount of DAGs in one Airflow cluster. Webserver 
frequently goes down because 'collecting DAG time' > 'webserver gunicorn worker 
refreshing time'. So this feature is suggested for all users to run >= 1,000 
DAGs.
   
   Even now we have async_dag_loader, there is still a memory issue. Composer 
runs Airflow webserver on a separate VM. Collecting thousands of DAGs is memory 
intense (DAG objects are not very memory consuming though). Therefore, users 
may still find webserver down due to OOM. Therefore we suggest users to have 
[webserver] workers=1. We are currently working on storing 'stringified DAGs' 
in DB.
   
   > casting to BaseOperator for non-airflow modules
   
   Classes defined in non-airflow modules may not be unpickled. These modules 
are imported in 'DAG collecting' processes, but not imported in webserver main 
process. Unpickling them would lead to 'module not found' errors. If we import 
these modules in webserver main processes, we have to process DAG files, it 
goes back to sync DAG loading again.
   
   Here is an example: 
https://github.com/apache/airflow/blob/master/airflow/example_dags/example_skip_dag.py
   
   In this Airflow test DAG, these is a non-airflow operator "class 
DummySkipOperator" (not defined in airflow/operators or 
airflow/contrib/operators). The DAG containing that operator can not be 
unpickled unless we replace that with BaseOperator.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to