abhipalsingh commented on issue #68693:
URL: https://github.com/apache/airflow/issues/68693#issuecomment-4741633498
Thanks — that matches our experience exactly. We haven't eliminated it
(that needs the executor-side change), but here's the
before/after that bounded it for us, in case it helps anyone landing here:
Before (accumulating → OOM):
- gunicorn, [api] workers = 4, [api] worker_refresh_interval = 43200 (12 h
rolling refresh)
- Each refresh SIGTERMs workers → their KubernetesExecutor Manager
children reparent to PID 1 and are never reaped → orphaned
serve_forever processes accumulate across cycles until the pod OOMs (8 Gi).
After (bounded, stable):
- [api] worker_refresh_interval = 0 (disable the periodic worker recycle)
- With long-lived workers the per-worker Manager stays attached and is
reused (the executor is cached per process), so it's capped
at ~1 Manager/worker (≤ workers per pod) instead of growing. Memory went
flat at ~⅓ of the limit.
Caveats: this only caps it — each worker still holds one idle
serve_forever Manager for the pod's lifetime (cleared on pod
restart/deploy). Raising workers raises that floor (1/worker), and any
worker recycling (worker_refresh or gunicorn max_requests)
re-introduces the orphan accumulation. So it's a mitigation, not a fix —
the real fix is not creating/leaking the Manager on the
get_task_log path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]