rehan243 commented on issue #68693: URL: https://github.com/apache/airflow/issues/68693#issuecomment-4740381643
Oh yikes, yeah, we've seen similar issues with `multiprocessing.Manager` hanging around longer than it should. That `serve_forever` process is a classic leak culprit when the parent doesn't explicitly clean it up. The part about the gunicorn workers recycling making it worse—totally ran into that too. If you've got multiple pods scaling up and down, those orphaned processes can explode quickly. One thing we ended up doing in a related setup was swapping out the `Manager` for something lighter, like using a plain `Queue` or `ThreadPoolExecutor` for transient tasks, since those don't leave behind persistent processes. Not sure if that's doable here given the executor design, but might be worth looking into. Curious—what's your API server's worker setup look like? Are you using the default gunicorn config or tweaking worker counts/timeouts? Wondering if scaling differently could mitigate the OOMs a bit while you're debugging this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
