sam-dumont commented on issue #59378: URL: https://github.com/apache/airflow/issues/59378#issuecomment-4055748325
We hit this on our prod cluster (830 DAGs, ~25k tasks/day, 2 schedulers, KubernetesExecutor, ~500 peak concurrent workers on EKS) and spent 2+ weeks instrumenting it. Sharing what we found in case it helps. We traced it to at least 2 independent race conditions, both unguarded UPDATEs on task instance state that can overwrite a RUNNING task : 1. **`schedule_tis()` in `dagrun.py`** : a competing scheduler overwrites RUNNING → SCHEDULED. Worker heartbeats, gets 409 with `current_state: scheduled`, task killed. PR #60330 by @ephraimbuddy addresses this. 2. **`ti_skip_downstream()` in the Execution API** : `BranchOperator` marks unchosen tasks as SKIPPED, but the UPDATE has no state guard, so it can overwrite tasks already RUNNING. Worker heartbeats, gets 409 with `current_state: skipped`, task killed. We opened PR #63266 for this one. Both seem to exist in every Airflow 3.x release with 2+ schedulers. The frequency likely scales with concurrency : more schedulers and more tasks = wider race window. **What worked for us :** We patched both code paths at Docker build time (patching the installed `.py` files so every process gets the fix). Our [`apply_patches.py`](https://gist.github.com/sam-dumont/4bdc214a7673f6571f4fa7058153a43c) script does this (auto-detects already-patched files, self-disables when upstream fix lands). Got us from 374 errors/day to 0. One thing we learned the hard way : monkey patches in `airflow_local_settings.py` worked for `schedule_tis()` (runs in the scheduler) but NOT for `ti_skip_downstream()`. That one runs on the API server (FastAPI/uvicorn), which has a different startup path and never imports `airflow_local_settings.py`. We had zero 409s for 18h, then `skipped` 409s exploded to 131/day when load increased. Build-time patching was the only approach that covered all processes. **Diagnosing which vector you're hitting :** Parsing the `current_state` field from the 409 response body was the key to telling them apart : | `current_state` | What's happening | Relevant PR | |---|---|---| | `scheduled` | `schedule_tis()` race : RUNNING → SCHEDULED | PR #60330 | | `failed` | Cascade from the above | PR #60330 | | `skipped` | `ti_skip_downstream()` race : RUNNING → SKIPPED (branching DAGs) | PR #63266 | | `not_found` | K8s executor duplicate pod / stale UUID after scheduler restart | No fix yet (#57618) | **Our timeline** (2+ weeks of continuous monitoring) : | Date | 409s | What changed | |---|---|---| | Feb 27-Mar 5 | 14-169 | baseline, no fixes | | Mar 6 | 22 | `schedule_tis` patch deployed : `scheduled`+`failed` dropped to 0 | | Mar 7-9 | 3-4 | only `skipped` remaining | | Mar 10 | 0 | both patches active | | **Mar 13** | **0** | **build-time patches on PROD : 0 race 409s, 0 orphaned tasks** | **Note if you're on 3.1.7 specifically :** on top of the race conditions above, 3.1.7 has a missing `task_reschedule` index + `DepContext` mutation leak (#59604) that slows the scheduler, grows memory (ours peaked at 5.92/6 GiB), and widens the race window for both bugs. This created a feedback loop on our cluster where 409s escalated exponentially (5 → 26 → 133 → 374/day). Upgrading to 3.1.8 broke the loop (#60931, #62089), but the race conditions still need the patches on top. **Related PRs :** | What | PR | Status | |---|---|---| | `schedule_tis()` state guard | #60330 (@ephraimbuddy) | Open, validated on our prod | | `ti_skip_downstream()` state guard | #63266 (ours) | Open | | Missing index + DepContext leak (3.1.7 only) | #60931, #62089 | Fixed in 3.1.8 | (AI-assisted investigation with Claude Code. All monitoring data from our prod cluster.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
