I think it's a good idea. I don't expect "huge" gains (and of course the bigger DAGs, the lower gains) - but we've been doing a lot of small optimissations here and there and when they accumulate, the compound effect might be significant. And it does not increase complexity of Airflow or costs us a lot of effort to implement.
The nice thing is that it not only saves the CPU /time for parsing but also some memory. When you import before forking to sub-processes, the memory used by such imports is shared until it gets modified (copy-on-write) and in case of imports it is safe to assume that most of the memory used by imports is not modified after allocation. The more parsing processes you have, the bigger the gain here. Though CPU savings might bring the costs of running Airflow a bit down. I think (see the comments in the PR) rather than parsing DAGs and figured I'd propose to import **all** (or maybe if that will take too much time we could preselect a subset of those that are most likely to be imported) airflow.* imports ???. Maybe excluding providers, as they might take a lot of time and the gains might be invalidated by having to keep all integrations in memory. Though on the other hand importing the "used" providers upfront might indeed save a lot of time for re-parsing especially so some benchmarking here might be useful I think some benchmarks to see how much we really safe would be great to do. I think good start could be to **just** load our example_dags (for core airflow imports) and system tests (for providers). They are good source of variety of uses of various constructs we have and system_tests especially are a good "benchnmark" for providers as well - and since we have system tests for many bigger providers, seeing gains there is quite possible to do some benchmarks on. I think having such benchmarks would enable us to see what really makes sense, as now it is more of an anecdotal impact. I see no immediate dangers with this approaach. The one problem - potentially - could be loosing ability of dynamic reconfiguration of Airflow by not importing the configuration, but configuration is already loaded ahead of it. J. On Thu, Apr 6, 2023 at 1:47 AM Vandon, Raphael <[email protected]> wrote: > TL;DR: We can improve DAG parsing time by pre-importing airflow > dependencies in the main process before forking, thus reducing overhead and > saving a lot of time. > > -- > Looking at DAG parsing times recently, I noticed that a good amount of > time is spent processing imports. While the Good Practices[1] guide > addresses the issue of top-level imports consuming time and generating > overhead, there is still one critical import that cannot be easily switched > to a local import: importing Airflow dependencies. For instance, just `from > airflow.decorators import dag, task` can take tens of ms every time a DAG > is parsed. > > Importing those dependencies is slow because they are not in the modules > cache, so the whole import tree has to be walked and evaluated. And since > the modules cache is local to the process, and we spawn a new process for > each dag, there is effectively no caching happening. > > To mitigate this issue, I propose that we pre-import modules in the main > process before it’s forked. Since forking creates a copy of the memory, > we’d be able to fetch those modules from the cache in child processes. > While we cannot import every module, pre-importing commonly used Airflow > modules could significantly improve parsing times. > > It is a bit hard to provide number without “reference dags” to evaluate > against, but just importing `from airflow.decorators import dag, task` > before forking showed improvements to dag processing time in the 10-20% > range. > > To further enhance DAG parsing efficiency, we could analyze imports from > the dag files, extract the ones that are airflow modules, and import all of > them using importlib. In my testing, this can make dag parsing upwards of > 60% faster. > I opened a draft PR if you would like to see what it'd look like in the > code: https://github.com/apache/airflow/pull/30495 > > I would like to hear feedback and thoughts on this. Are there any > potential drawbacks to this that I missed? > > [1] > https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code > >
