rehan243 commented on issue #68693:
URL: https://github.com/apache/airflow/issues/68693#issuecomment-4740381643

   Oh yikes, yeah, we've seen similar issues with `multiprocessing.Manager` 
hanging around longer than it should. That `serve_forever` process is a classic 
leak culprit when the parent doesn't explicitly clean it up. The part about the 
gunicorn workers recycling making it worse—totally ran into that too. If you've 
got multiple pods scaling up and down, those orphaned processes can explode 
quickly.
   
   One thing we ended up doing in a related setup was swapping out the 
`Manager` for something lighter, like using a plain `Queue` or 
`ThreadPoolExecutor` for transient tasks, since those don't leave behind 
persistent processes. Not sure if that's doable here given the executor design, 
but might be worth looking into.
   
   Curious—what's your API server's worker setup look like? Are you using the 
default gunicorn config or tweaking worker counts/timeouts? Wondering if 
scaling differently could mitigate the OOMs a bit while you're debugging this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to