abhipalsingh commented on PR #68708: URL: https://github.com/apache/airflow/pull/68708#issuecomment-4741775521
Adding a data point from the API server side, since this fork behavior isn't only a scheduler concern. On Airflow 3.2.1 with apache-airflow-providers-openlineage==2.14.0 (pre-#65677), manual task-instance state changes via the REST API (mark success/failed/skipped, clear) fire on_task_instance_* on the api-server — a multithreaded async process (FastAPI/uvicorn under gunicorn). The listener's _fork_execute calls os.fork() from that multithreaded worker; a fraction of children deadlock immediately on an inherited lock (py-spy showed them parked in futex_wait_queue, never reaching the post-fork setproctitle) and are never reaped → ~350–400 MB each → unbounded accumulation → api-server OOM. #65677 helps the manual-state-change path (routes it through the ProcessPoolExecutor instead of a raw fork), but (a) that still forks a pool from the multithreaded async server, and (b) the natural-lifecycle handlers still use use_fork=True. So thread-based emission (this issue) is the cleaner fit for async contexts like the api-server, where os.fork() is fundamentally unsafe. We worked around it by disabling OpenLineage on the api-server (no transport configured there anyway), but big +1 for execute_in_thread as the general fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
