wjddn279 commented on PR #65943:
URL: https://github.com/apache/airflow/pull/65943#issuecomment-4439175172
@diogosilva30
Interesting. I read through your analysis and it looks like a correct
explanation.
I have a question — **how often does this error occur? Does the user's code
change frequently?**
If I understand correctly, the issue is that among the multiple threads
running in the edge worker, if a fork is performed while another thread (one
not performing the fork) is in the middle of an import, it can cause problems
in the import system. If that's the case, the problem would arise when a new
module is being imported in another thread.
Even with lazy loading, since the edge worker follows a fixed footprint, I'm
curious whether new module loading happens frequently. Since a module that has
already been imported once should no longer be a source of the problem, I would
expect the frequency to gradually decrease over time.
Applying the same approach that exists in Celery seems like a good idea.
However, the trade-offs should be carefully understood. With airflow, simply
loading the airflow module alone loads 100mb of libraries. The existing fork
approach significantly reduces PSS through COW, but this approach causes memory
to increase linearly with the number of concurrent executions. And slow loading
is a bonus downside.
below is checking the PSS usage when just `import
airflow.sdk.execution_time.execute_workload` in subprocess
```
=== A. subprocess.Popen (fresh interpreter, no sharing) === parent pid=99
RSS=118.4 MiB PSS=98.3 MiB
pid RSS MiB PSS MiB Private MiB
100 117.9 97.8 95.8
101 117.9 97.8 95.8
102 117.9 97.8 95.8
=== B. multiprocessing.Process (fork, COW with parent) === parent pid=99
RSS=118.5 MiB PSS=30.3 MiB
pid RSS MiB PSS MiB Private MiB
110 99.7 12.7 4.0
111 99.7 12.7 4.0
112 99.7 12.7 4.0
```
As jens mentioned, the problem is clear enough that it could have been
reported by now, so it's also a bit curious that it hasn't been.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]