Thank Jarek very much for the detailed guidance — it was extremely helpful.
I fully agree that preloading Airflow internal modules should be considered later, and I also think your points around external libraries are very reasonable and worth exploring. When I first read Ash’s earlier comment (https://github.com/apache/airflow/discussions/58143#discussioncomment-14926133), I didn’t fully grasp the implications. However, after reading your response, it became much clearer what additional aspects need to be considered. As you pointed out, the most critical question seems to be how to handle modules that are expected to be reloaded periodically, which in most cases are user-authored modules. Simply excluding packages located under the dags or plugins directories does not seem sufficient, as there are many different ways users structure and deploy their code. As an example, in our team we periodically rsync a Python package developed in a legacy system from another server and add it to PYTHONPATH. Handling such diverse cases with a fixed rule set would be very difficult. Initially, my proposal was to provide an option for users to explicitly specify which libraries should be pre-imported. However, after considering your suggestion, I now think it might be more effective to invert the approach: preload all imported modules by default, and allow users to explicitly specify which modules should not be pre-imported (i.e. modules that should be loaded in the forked processes). With this approach, the vast majority of users — including “light users” who may not have deep knowledge of Airflow internals or memory optimization — would automatically benefit from the performance improvements. That said, we would still need to think carefully about how to clearly document and guide users in configuring these exceptions for the edge cases mentioned above. This is, of course, an area that would benefit from further discussion. I think this part can be revisited in a more concrete form after we apply gc.freeze to the DAG processor and Celery ( if necessary), and after we have run real performance tests. At that point, I expect to come back with a more concrete proposal, likely in the form of an AIP. Once again, thank you very much for the insights, Jarek — they were greatly appreciated. I should also note that I do not yet have a deep understanding of Celery’s internal execution model, so applying these ideas there will require careful validation and extensive testing. I agree that this should be approached incrementally. Thank you as well, Jens and Aritra, for the interest and thoughtful feedback. Best regards, Jeongwoo Do --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
