8silvergun opened a new issue, #66763:
URL: https://github.com/apache/airflow/issues/66763

   ### Under which category would you file this issue?
   
   Airflow Core
   
   ### Apache Airflow version
   
   3.2.1
   
   ### What happened and how to reproduce it?
   
   
   #### What happened
   
   On a production Airflow 3.2.1 + MySQL deployment with ~3.2M rows in 
`task_instance`, the scheduler stopped queueing tasks. `TaskInstance`s remained 
in `scheduled` and did not transition to `queued`. Worker pods that had already 
completed their work could not report their final state back through the 
execution API, because metadata DB lock waits piled up behind the scheduler's 
critical-section query.
   
   Worker-side symptom (Airflow 3 execution API client):
   
   ```
   ReadTimeout: timed out
   airflow.sdk.api.client.Client.request
   PATCH /execution/task-instances/<...>/state
   EOFError: Request socket closed before length
   BrokenPipeError: [Errno 32] Broken pipe
   ```
   
   Scheduler-side symptom:
   
   ```
   sqlalchemy.exc.OperationalError: Query execution was interrupted
   SELECT task_instance ... LIMIT 16 FOR UPDATE OF task_instance SKIP LOCKED
   ```
   
   #### Root cause
   
   The slow query is generated by
   
`airflow.jobs.scheduler_job_runner.SchedulerJobRunner._executable_task_instances_to_queued`.
   
   PR #54103 (shipped in 3.2.0) reshaped this method into a two-step pattern: a 
`ranked_query` that picks candidate `SCHEDULED` task instances, followed by an 
**outer query** that rejoins the candidates back to `task_instance` by the 
composite identity `(dag_id, task_id, run_id, map_index)`.
   
   Both the inner ranked query **and** the outer rejoin currently apply the 
same MySQL index hint:
   
   ```python
   .with_hint(TI, "USE INDEX (ti_state)", dialect_name="mysql")
   ```
   
   The hint is appropriate on the inner query — that query filters by `TI.state 
== SCHEDULED` and the `ti_state(state)` index is the right one.
   
   The hint is **harmful on the outer query**. The outer query is not a state 
lookup; it is an exact lookup keyed by `(dag_id, task_id, run_id, map_index)`. 
For that pattern, MySQL should use the `task_instance_composite_key(dag_id, 
task_id, run_id, map_index)` index. But the hint restricts the optimizer to 
`ti_state`, so the composite key is unreachable and the optimizer falls back to 
a full scan of `task_instance`.
   
   #### Source check (current `main`, file 
`airflow-core/src/airflow/jobs/scheduler_job_runner.py`)
   
   The function `_executable_task_instances_to_queued` contains **two** 
occurrences of `.with_hint(TI, "USE INDEX (ti_state)", dialect_name="mysql")`:
   
   1. On the inner candidate query that filters `DR.state == RUNNING` and 
`TI.state == SCHEDULED` — appropriate.
   2. On the outer query: 
`select(TI).with_hint(...).select_from(ranked_query).join(TI, (TI.dag_id == 
ranked_query.c.dag_id) & ...)` — **misplaced**.
   
   #### Evidence (production EXPLAIN on Aurora MySQL 8.0)
   
   Metadata table state at the time of incident (2026-05-11 KST):
   
   ```
   task_instance rows: ~3,221,348
   SCHEDULED task_instance: ~187
   RUNNING/QUEUED task_instance: 0
   RUNNING dag_run: ~141
   ```
   
   Relevant existing indexes on `task_instance`:
   
   ```
   task_instance_composite_key(dag_id, task_id, run_id, map_index)
   ti_state(state)
   ti_dag_run(dag_id, run_id)
   ti_state_lkp(dag_id, task_id, run_id, state)
   ```
   
   EXPLAIN of the unmodified scheduler query (outer hint present):
   
   ```
   PRIMARY task_instance ALL rows=3,221,348
   Using where; Using join buffer
   ```
   
   EXPLAIN after removing only the **outer** `USE INDEX (ti_state)` hint:
   
   ```
   PRIMARY task_instance eq_ref
   key = task_instance_composite_key
   rows = 1
   Using index
   ```
   
   Inner ranked query in both cases (kept as-is):
   
   ```
   DERIVED task_instance ref
   key = ti_state
   rows = 187
   ```
   
   So the same scheduler loop reads either ~3.2M rows or 1 row per candidate, 
depending solely on the outer hint.
   
   #### Minimal reproduction
   
   1. Stand up Airflow 3.2.0 or 3.2.1 against MySQL 8.0.
   2. Populate `task_instance` so it has on the order of millions of total rows 
but only a small handful in `SCHEDULED` (this matches a long-lived production 
deployment).
   3. Run the scheduler under load.
   4. `EXPLAIN` the query emitted from `_executable_task_instances_to_queued`; 
the outer join to `task_instance` will be `type=ALL`.
   5. Remove only the second `.with_hint(TI, "USE INDEX (ti_state)", 
dialect_name="mysql")` (the one immediately before 
`.select_from(ranked_query)`). Re-run and `EXPLAIN` — the outer join becomes 
`eq_ref` on `task_instance_composite_key`.
   
   ### What you think should happen instead?
   
   
   The outer query (the rejoin to `task_instance` keyed by the composite 
identity) should use `task_instance_composite_key`, not `ti_state`. The fix is 
to remove the second `.with_hint(TI, "USE INDEX (ti_state)", 
dialect_name="mysql")` only. The inner hint should stay.
   
   After the fix, on the same workload:
   
   ```
   outer task_instance: task_instance_composite_key, eq_ref, rows=1
   inner scheduled-TI scan: ti_state, ref/range over scheduled states
   ```
   
   
   
   ### Operating System
   
   Linux (kernel 6.x), Amazon EKS nodes on Bottlerocket / AL2023.
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   1.20.0
   
   ### Kubernetes Version
   
   1.33
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   
   - The original `USE INDEX (ti_state)` hint was added in apache/airflow#25627 
(≈2022, Airflow 2.2.5) to fix a *separate* MySQL optimizer issue where the 
optimizer ignored `ti_state` on a single-step query. At the time, the hint 
placement was correct.
   - PR #54103 (merged 2026-01-28, ships in 3.2.0) refactored 
`_executable_task_instances_to_queued` into the ranked + outer-rejoin shape and 
replicated the hint onto the outer rejoin. **This is the regression 
introduction point.**
   - The bug is dialect-scoped (`dialect_name="mysql"`), so PostgreSQL 
deployments are unaffected. This may be why the regression has gone unreported 
by managed-service users (MWAA/Composer typically default to Postgres).
   - The bug surfaces only when `task_instance` is large enough that a full 
scan is materially slower than an `eq_ref` lookup. Small test environments will 
not reproduce visible symptoms even though the EXPLAIN is wrong.
   - Internal mitigation already deployed: a sitecustomize-based hotpatch that 
removes only the second hint at process start. Commit `9482f6f8 fix: hotpatch 
airflow scheduler ti state hint` in our internal repo. No other behavior 
changed.
   - Possibly related but **not the same** issue: apache/airflow#57210 
(deferred-task priority in the scheduler selection query) — references the same 
code path but a different symptom.
   
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to