sw-cyderes opened a new issue, #67287:
URL: https://github.com/apache/airflow/issues/67287
### Under which category would you file this issue?
Airflow Core
### Apache Airflow version
3.2.1
### What happened and how to reproduce it?
Same race condition as #66374, but the post-defer TI lands in queued state
rather than scheduled, so the fix in #66431 / backport #67089 (merged for
3.2.2) may not cover it. The scheduler still treats the stale executor success
event from the worker's defer-exit as a state mismatch and kills the task
externally.
Timeline from one occurrence:
[2026-05-20 13:37:14] worker TI started (try_number=1)
[2026-05-20 13:37:18] worker Pausing task as DEFERRED (defer() called,
worker exits cleanly)
[2026-05-20 13:37:45] scheduler Executor LocalExecutor reported that the
task
instance <TaskInstance: .... [queued]>
finished with
state success, but the task instance's
state attribute is queued.
Task marked as up_for_retry
[2026-05-20 13:37:??] worker try_number=2 starts, success
Notice the TI state attribute is queued (and the TI bracket label is also
[queued]), not scheduled. The fix added by #66431 in process_executor_events
only marks the event as ti_requeued when state == SCHEDULED and next_method is
not None. The same logical condition (resume-after-defer with a stale executor
success) can also leave the TI in QUEUED under load — that path falls through
the new branch and the task is still failed/retried.
We see this race fire on a subset of deferred tasks per DAG run. It is not
reproducible deterministically — same code, same workload, sometimes races,
sometimes doesn't.
Differences and linkages with current tickets
- #66374 (CLOSED): Same race, scheduled-state variant. Fixed by #66431
(merged to main 2026-05-18, milestone 3.2.2) and backported via #67089 to
v3-2-test. The fix added the ti_requeued branch for state == SCHEDULED with
next_method set. This report is the analogous case for state == QUEUED, which
the existing branch does not cover.
- #66431 / #67089: The PRs that fix #66374. After applying these, the
scheduled-state mismatch goes away, but in our environment the queued-state
mismatch continues to fire on the same operator (RunMergerEc2Operator) on the
same defer path. The fix should likely be extended to also treat state ==
QUEUED with next_method is not None as a stale defer-exit success, since the
trigger may have rescheduled the TI into either QUEUED or SCHEDULED depending
on which scheduler loop iteration processes the trigger event first.
- #53797 (CLOSED, inconclusive): Earlier report of the [queued]-state
mismatch on 3.0.3 with LocalExecutor. Closed because the original reporter
couldn't repro on their helm deployment ("edge case for docker macOS"), but the
comment thread has follow-up reports on Linux / GKE / 3.0.4. This appears to be
the same family of race, still present in 3.2.x, and worth treating as a
distinct, ongoing bug rather than a Docker-for-Mac quirk.
- #23824 / #23846: The original 2.x-era fix for the queued-state mismatch.
The 3.x task-SDK / API-server architecture appears to have re-opened this state
pair on the defer path.
Reproducer (probabilistic)
1. Airflow 3.2.x, LocalExecutor
2. A deferrable operator that: Calls self.defer(trigger=...,
method_name="execute_complete").
3. Observe the audit/event log. A subset of TIs may hit the [queued]
mismatch instead of scheduled.
### What you think should happen instead?
The ti_requeued branch in process_executor_events should also handle the
queued-state variant, e.g.:
if (
state == TaskInstanceState.SUCCESS
and ti.next_method is not None
and ti.state in (TaskInstanceState.SCHEDULED, TaskInstanceState.QUEUED)
):
# stale defer-exit success, treat as requeue
...
### Operating System
Airflow runs in the official apache/airflow:3.2.1 Docker image on Linux.
Host: Ubuntu / Linux.
### Deployment
Docker-Compose
### Apache Airflow Provider(s)
_No response_
### Versions of Apache Airflow Providers
_No response_
### Official Helm Chart version
Not Applicable
### Kubernetes Version
_No response_
### Helm Chart configuration
_No response_
### Docker Image customizations
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]