wjddn279 commented on PR #65943: URL: https://github.com/apache/airflow/pull/65943#issuecomment-4442215679
@diogosilva30 Thanks for reply! > On the "gradually decreases" question, I see where you're coming from, but the problem here isn't plugin imports in user code. The warning fires on every single fork throughout the worker's lifetime because supervisor.py itself keeps a thread pool alive for async log pushing (aiofiles/anyio). Add GCS credential refreshes and Secret Manager calls on every task completion and you've got live threads on basically every fork. The race doesn't go away after warmup. Fair point. I was only thinking about the import-lock case, but it sounds like the locking issues span much more broadly across threads than just imports. > What about multiprocessing.forkserver? After looking into how forkserver actually works, **it looks like a better fit than a fork pool for a multi-threaded environment like the edge worker**. There is no need to write many code to apply it. Spinning up the server before the async cycle starts and forking from there seems like the right shape. > Leaks, stale connections, open fds all accumulate across task runs. Right now every task gets a clean process and that isolation earns its keep. The forked process doesn't do anything beyond running supervise_task. Any connections or fds it creates are explicitly closed. (There were a few leaks in practice, but I've fixed most of them by now.) > Plugin hot reload via git-sync breaks until a worker recycles since long-lived workers hold onto their imported code. Same here — by design the worker doesn't import user code directly. All user code is loaded only inside the short-lived process forked by supervise_task. > It touches the core execution model (dispatcher, queueing, lifecycle, supervision). Way bigger blast radius than this PR warrants. Agreed. There's reference code we could lean on, but the diff is still substantial — you'd need a Queue for IPC and liveness checks layered on top. > If a pool worker dies mid-task, you have to detect it, restart it, and reconcile with the Edge API. Per-task processes just... handle that. This was actually my biggest concern with the pool approach. If a worker dies you have to restart it, but there's no guarantee the restarted process is lock-safe either. **That's exactly where forkserver looks like the strong option** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
