dabla commented on issue #59378:
URL: https://github.com/apache/airflow/issues/59378#issuecomment-4056773277

   > We hit this on our prod cluster (830 DAGs, ~25k tasks/day, 2 schedulers, 
KubernetesExecutor, ~500 peak concurrent workers on EKS) and spent 2+ weeks 
instrumenting it. Sharing what we found in case it helps.
   > 
   > We traced it to at least 2 independent race conditions, both unguarded 
UPDATEs on task instance state that can overwrite a RUNNING task :
   > 
   > 1. **`schedule_tis()` in `dagrun.py`** : a competing scheduler overwrites 
RUNNING → SCHEDULED. Worker heartbeats, gets 409 with `current_state: 
scheduled`, task killed. PR [Fix HA scheduler try_number double increment 
#60330](https://github.com/apache/airflow/pull/60330) by 
[@ephraimbuddy](https://github.com/ephraimbuddy) addresses this.
   > 2. **`ti_skip_downstream()` in the Execution API** : `BranchOperator` 
marks unchosen tasks as SKIPPED, but the UPDATE has no state guard, so it can 
overwrite tasks already RUNNING. Worker heartbeats, gets 409 with 
`current_state: skipped`, task killed. We opened PR [Fix ti_skip_downstream 
overwriting RUNNING tasks to SKIPPED 
#63266](https://github.com/apache/airflow/pull/63266) for this one.
   > 
   > Both seem to exist in every Airflow 3.x release with 2+ schedulers. The 
frequency likely scales with concurrency : more schedulers and more tasks = 
wider race window.
   > 
   > **What worked for us :**
   > 
   > We patched both code paths at Docker build time (patching the installed 
`.py` files so every process gets the fix). Our 
[`apply_patches.py`](https://gist.github.com/sam-dumont/4bdc214a7673f6571f4fa7058153a43c)
 script does this (auto-detects already-patched files, self-disables when 
upstream fix lands). Got us from 374 errors/day to 0.
   > 
   > One thing we learned the hard way : monkey patches in 
`airflow_local_settings.py` worked for `schedule_tis()` (runs in the scheduler) 
but NOT for `ti_skip_downstream()`. That one runs on the API server 
(FastAPI/uvicorn), which has a different startup path and never imports 
`airflow_local_settings.py`. We had zero 409s for 18h, then `skipped` 409s 
exploded to 131/day when load increased. Build-time patching was the only 
approach that covered all processes.
   > 
   > **Diagnosing which vector you're hitting :**
   > 
   > Parsing the `current_state` field from the 409 response body was the key 
to telling them apart :
   > 
   > `current_state`    What's happening        Relevant PR
   > `scheduled`        `schedule_tis()` race : RUNNING → SCHEDULED     PR 
[#60330](https://github.com/apache/airflow/pull/60330)
   > `failed`   Cascade from the above  PR 
[#60330](https://github.com/apache/airflow/pull/60330)
   > `skipped`  `ti_skip_downstream()` race : RUNNING → SKIPPED (branching 
DAGs)        PR [#63266](https://github.com/apache/airflow/pull/63266)
   > `not_found`        K8s executor duplicate pod / stale UUID after scheduler 
restart No fix yet ([#57618](https://github.com/apache/airflow/issues/57618))
   > **Our timeline** (2+ weeks of continuous monitoring) :
   > 
   > Date       409s    What changed
   > Feb 27-Mar 5       14-169  baseline, no fixes
   > Mar 6      22      `schedule_tis` patch deployed : `scheduled`+`failed` 
dropped to 0
   > Mar 7-9    3-4     only `skipped` remaining
   > Mar 10     0       both patches active
   > **Mar 13** **0**   **build-time patches on PROD : 0 race 409s, 0 orphaned 
tasks**
   > **Note if you're on 3.1.7 specifically :** on top of the race conditions 
above, 3.1.7 has a missing `task_reschedule` index + `DepContext` mutation leak 
([#59604](https://github.com/apache/airflow/pull/59604)) that slows the 
scheduler, grows memory (ours peaked at 5.92/6 GiB), and widens the race window 
for both bugs. This created a feedback loop on our cluster where 409s escalated 
exponentially (5 → 26 → 133 → 374/day). Upgrading to 3.1.8 broke the loop 
([#60931](https://github.com/apache/airflow/pull/60931), 
[#62089](https://github.com/apache/airflow/pull/62089)), but the race 
conditions still need the patches on top.
   > 
   > **Related PRs :**
   > 
   > What       PR      Status
   > `schedule_tis()` state guard       
[#60330](https://github.com/apache/airflow/pull/60330) 
([@ephraimbuddy](https://github.com/ephraimbuddy))       Open, validated on our 
prod
   > `ti_skip_downstream()` state guard 
[#63266](https://github.com/apache/airflow/pull/63266) (ours)   Open
   > Missing index + DepContext leak (3.1.7 only)       
[#60931](https://github.com/apache/airflow/pull/60931), 
[#62089](https://github.com/apache/airflow/pull/62089)  Fixed in 3.1.8
   > (AI-assisted investigation with Claude Code. All monitoring data from our 
prod cluster.)
   
   Nice investigation and explanation, we don’t have those problems as often 
anymore. Will try do do the same at our side and see if it also fixes the issue.
   
   We are running with 3 concurrent schedulers but since we now migrated from 
YugabyteDB to Postgres again that maybe also helped already. As YugabyteDB 
isn’t officially supported on Airflow  we had to do some post alterations to 
make it run correctly on YugabyteDB as the Airflow migrations scripts aren’t 
optimized for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to