Faster dag parsing by pre-importing airflow modules

Vandon, Raphael Wed, 05 Apr 2023 16:47:26 -0700

TL;DR: We can improve DAG parsing time by pre-importing airflow dependencies in 
the main process before forking, thus reducing overhead and saving a lot of 
time.


--
Looking at DAG parsing times recently, I noticed that a good amount of time is 
spent processing imports. While the Good Practices[1] guide addresses the issue 
of top-level imports consuming time and generating overhead, there is still one 
critical import that cannot be easily switched to a local import: importing 
Airflow dependencies. For instance, just `from airflow.decorators import dag, 
task` can take tens of ms every time a DAG is parsed.

Importing those dependencies is slow because they are not in the modules cache, 
so the whole import tree has to be walked and evaluated. And since the modules 
cache is local to the process, and we spawn a new process for each dag, there 
is effectively no caching happening.

To mitigate this issue, I propose that we pre-import modules in the main 
process before it’s forked. Since forking creates a copy of the memory, we’d be 
able to fetch those modules from the cache in child processes. While we cannot 
import every module, pre-importing commonly used Airflow modules could 
significantly improve parsing times.

It is a bit hard to provide number without “reference dags” to evaluate 
against, but just importing `from airflow.decorators import dag, task` before 
forking showed improvements to dag processing time in the 10-20% range.

To further enhance DAG parsing efficiency, we could analyze imports from the 
dag files, extract the ones that are airflow modules, and import all of them 
using importlib. In my testing, this can make dag parsing upwards of 60% faster.
I opened a draft PR if you would like to see what it'd look like in the code: 
https://github.com/apache/airflow/pull/30495

I would like to hear feedback and thoughts on this. Are there any potential 
drawbacks to this that I missed? 

[1] 
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code

Faster dag parsing by pre-importing airflow modules

Reply via email to