dabla commented on issue #59378: URL: https://github.com/apache/airflow/issues/59378#issuecomment-4056773277
> We hit this on our prod cluster (830 DAGs, ~25k tasks/day, 2 schedulers, KubernetesExecutor, ~500 peak concurrent workers on EKS) and spent 2+ weeks instrumenting it. Sharing what we found in case it helps. > > We traced it to at least 2 independent race conditions, both unguarded UPDATEs on task instance state that can overwrite a RUNNING task : > > 1. **`schedule_tis()` in `dagrun.py`** : a competing scheduler overwrites RUNNING → SCHEDULED. Worker heartbeats, gets 409 with `current_state: scheduled`, task killed. PR [Fix HA scheduler try_number double increment #60330](https://github.com/apache/airflow/pull/60330) by [@ephraimbuddy](https://github.com/ephraimbuddy) addresses this. > 2. **`ti_skip_downstream()` in the Execution API** : `BranchOperator` marks unchosen tasks as SKIPPED, but the UPDATE has no state guard, so it can overwrite tasks already RUNNING. Worker heartbeats, gets 409 with `current_state: skipped`, task killed. We opened PR [Fix ti_skip_downstream overwriting RUNNING tasks to SKIPPED #63266](https://github.com/apache/airflow/pull/63266) for this one. > > Both seem to exist in every Airflow 3.x release with 2+ schedulers. The frequency likely scales with concurrency : more schedulers and more tasks = wider race window. > > **What worked for us :** > > We patched both code paths at Docker build time (patching the installed `.py` files so every process gets the fix). Our [`apply_patches.py`](https://gist.github.com/sam-dumont/4bdc214a7673f6571f4fa7058153a43c) script does this (auto-detects already-patched files, self-disables when upstream fix lands). Got us from 374 errors/day to 0. > > One thing we learned the hard way : monkey patches in `airflow_local_settings.py` worked for `schedule_tis()` (runs in the scheduler) but NOT for `ti_skip_downstream()`. That one runs on the API server (FastAPI/uvicorn), which has a different startup path and never imports `airflow_local_settings.py`. We had zero 409s for 18h, then `skipped` 409s exploded to 131/day when load increased. Build-time patching was the only approach that covered all processes. > > **Diagnosing which vector you're hitting :** > > Parsing the `current_state` field from the 409 response body was the key to telling them apart : > > `current_state` What's happening Relevant PR > `scheduled` `schedule_tis()` race : RUNNING → SCHEDULED PR [#60330](https://github.com/apache/airflow/pull/60330) > `failed` Cascade from the above PR [#60330](https://github.com/apache/airflow/pull/60330) > `skipped` `ti_skip_downstream()` race : RUNNING → SKIPPED (branching DAGs) PR [#63266](https://github.com/apache/airflow/pull/63266) > `not_found` K8s executor duplicate pod / stale UUID after scheduler restart No fix yet ([#57618](https://github.com/apache/airflow/issues/57618)) > **Our timeline** (2+ weeks of continuous monitoring) : > > Date 409s What changed > Feb 27-Mar 5 14-169 baseline, no fixes > Mar 6 22 `schedule_tis` patch deployed : `scheduled`+`failed` dropped to 0 > Mar 7-9 3-4 only `skipped` remaining > Mar 10 0 both patches active > **Mar 13** **0** **build-time patches on PROD : 0 race 409s, 0 orphaned tasks** > **Note if you're on 3.1.7 specifically :** on top of the race conditions above, 3.1.7 has a missing `task_reschedule` index + `DepContext` mutation leak ([#59604](https://github.com/apache/airflow/pull/59604)) that slows the scheduler, grows memory (ours peaked at 5.92/6 GiB), and widens the race window for both bugs. This created a feedback loop on our cluster where 409s escalated exponentially (5 → 26 → 133 → 374/day). Upgrading to 3.1.8 broke the loop ([#60931](https://github.com/apache/airflow/pull/60931), [#62089](https://github.com/apache/airflow/pull/62089)), but the race conditions still need the patches on top. > > **Related PRs :** > > What PR Status > `schedule_tis()` state guard [#60330](https://github.com/apache/airflow/pull/60330) ([@ephraimbuddy](https://github.com/ephraimbuddy)) Open, validated on our prod > `ti_skip_downstream()` state guard [#63266](https://github.com/apache/airflow/pull/63266) (ours) Open > Missing index + DepContext leak (3.1.7 only) [#60931](https://github.com/apache/airflow/pull/60931), [#62089](https://github.com/apache/airflow/pull/62089) Fixed in 3.1.8 > (AI-assisted investigation with Claude Code. All monitoring data from our prod cluster.) Nice investigation and explanation, we don’t have those problems as often anymore. Will try do do the same at our side and see if it also fixes the issue. We are running with 3 concurrent schedulers but since we now migrated from YugabyteDB to Postgres again that maybe also helped already. As YugabyteDB isn’t officially supported on Airflow we had to do some post alterations to make it run correctly on YugabyteDB as the Airflow migrations scripts aren’t optimized for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
