Thank Jarek very much for the detailed guidance — it was extremely helpful.

I fully agree that preloading Airflow internal modules should be considered 
later, and I also think your points around external libraries are very 
reasonable and worth exploring.

When I first read Ash’s earlier comment 
(https://github.com/apache/airflow/discussions/58143#discussioncomment-14926133),
 I didn’t fully grasp the implications. However, after reading your response, 
it became much clearer what additional aspects need to be considered. As you 
pointed out, the most critical question seems to be how to handle modules that 
are expected to be reloaded periodically, which in most cases are user-authored 
modules.

Simply excluding packages located under the dags or plugins directories does 
not seem sufficient, as there are many different ways users structure and 
deploy their code. As an example, in our team we periodically rsync a Python 
package developed in a legacy system from another server and add it to 
PYTHONPATH. Handling such diverse cases with a fixed rule set would be very 
difficult.

Initially, my proposal was to provide an option for users to explicitly specify 
which libraries should be pre-imported. However, after considering your 
suggestion, I now think it might be more effective to invert the approach:
preload all imported modules by default, and allow users to explicitly specify 
which modules should not be pre-imported (i.e. modules that should be loaded in 
the forked processes).

With this approach, the vast majority of users — including “light users” who 
may not have deep knowledge of Airflow internals or memory optimization — would 
automatically benefit from the performance improvements. That said, we would 
still need to think carefully about how to clearly document and guide users in 
configuring these exceptions for the edge cases mentioned above. This is, of 
course, an area that would benefit from further discussion.

I think this part can be revisited in a more concrete form after we apply 
gc.freeze to the DAG processor and Celery (
if necessary), and after we have run real performance tests. At that point, I 
expect to come back with a more concrete proposal, likely in the form of an AIP.

Once again, thank you very much for the insights, Jarek — they were greatly 
appreciated.

I should also note that I do not yet have a deep understanding of Celery’s 
internal execution model, so applying these ideas there will require careful 
validation and extensive testing. I agree that this should be approached 
incrementally.

Thank you as well, Jens and Aritra, for the interest and thoughtful feedback.

Best regards,
Jeongwoo Do


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to