hkc-8010 opened a new issue, #67238:
URL: https://github.com/apache/airflow/issues/67238
### Apache Airflow version
2.11.2
### If "Other Airflow 2 version" selected, which one?
N/A
### What happened?
We have multiple anonymized production incidents on Airflow 2.11.2 with
`CeleryExecutor` where retry attempts that definitely ran are missing from
`task_instance_history`.
The visible user symptom is that the Task logs / tries UI does not show all
real attempts. But the more important observation is that this is not just a UI
problem: one or more retry attempts are actually absent from
`task_instance_history` even though scheduler and worker logs prove those
attempts executed.
Observed pattern in affected runs:
- the surviving `task_instance` row reflects the current or final attempt
- one or more earlier retry attempts are missing from `task_instance_history`
- `/tries` therefore omits those attempts because it is built from
`task_instance_history` plus the current `task_instance`
- in at least one case, logs for a missing attempt were still retrievable
directly, which makes the UI inconsistency more confusing
### What you think should happen instead?
Every retry attempt that actually executes should be preserved in
`task_instance_history`.
If a task reaches `deferred`, `up_for_retry`, `failed`, or another terminal
transition for a given try number, that try should still exist in
`task_instance_history` afterward, and `/tries` should list it.
### How to reproduce
I do not yet have a minimal standalone reproducer, but the repeated field
pattern is:
1. Run Airflow 2.11.2 with `CeleryExecutor`
2. Use a task that can retry, including cases that may defer and then retry
again
3. Let the task execute multiple tries
4. Inspect scheduler logs for the task/run and confirm a given try number
was sent to the executor and finished
5. Query `task_instance_history` for that same task/run
Observed result in affected runs:
- `task_instance.try_number` advances normally
- some earlier try numbers below the current try are missing from
`task_instance_history`
- `/tries` omits those missing attempts
### Operating System
Linux / Kubernetes
### Versions of Apache Airflow Providers
Not yet isolated to a provider-specific issue.
### Deployment
Other Kubernetes deployment
### Deployment details
These incidents were observed on Astro Hosted deployments running Runtime
13.6.0 (`Airflow 2.11.2+astro.2`) with:
- `CeleryExecutor`
- two scheduler replicas present
- PostgreSQL metadata DB
I am filing this upstream because the symptom is directly in core retry
history persistence (`task_instance_history`), not in a provider package.
### Anything else?
This does **not** look like a UI-only bug.
## Concrete example 1
Dag: `cs-forecast-cl-data-preprocessing-bk-eks-intg`
Task: `publish_data.pyspark`
Run: `manual__2026-04-20T13:05:04+00:00`
Direct DB query from the scheduler container after the incident:
```text
{'current_try_number': 3, 'current_state': 'failed', 'current_start_date':
'2026-04-22 13:00:30.668381+00:00', 'current_end_date': '2026-04-22
13:06:51.673156+00:00', 'current_hostname': '10.94.10.200',
'current_external_executor_id': 'a29904ce-e180-4b40-80d6-366e3a3b8cd2'}
history_rows=
{'try_number': 1, 'state': 'success', 'start_date': '2026-04-20
13:19:53.454647+00:00', 'end_date': '2026-04-20 13:25:16.544094+00:00',
'hostname': '10.94.23.102', 'external_executor_id':
'b7ba76f7-d337-4634-a5b6-989dd041eef1'}
{'history_try_numbers': [1], 'missing_try_numbers': [2]}
```
So the current row says `try_number=3`, but `task_instance_history` contains
only try `1`. Try `2` is missing.
Scheduler and worker logs prove try 2 actually ran:
```text
[2026-04-22T12:51:27.388+0000] {scheduler_job_runner.py:692} INFO - Sending
TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg',
task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00',
try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default
[2026-04-22 12:51:27,841: INFO/ForkPoolWorker-3] Running <TaskInstance:
cs-forecast-cl-data-preprocessing-bk-eks-intg.publish_data.pyspark
manual__2026-04-20T13:05:04+00:00 [queued]> on host 10.94.10.200
[2026-04-22T12:51:46.816+0000] {scheduler_job_runner.py:813} INFO -
TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg,
task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00,
map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00,
run_end_date=None, run_duration=323.089447, state=deferred,
executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=2,
max_tries=2, job_id=584134, pool=default_pool, queue=default,
priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22
12:51:27.386468+00:00, queued_by_job_id=583149, pid=3570
[2026-04-22T12:55:24.447+0000] {scheduler_job_runner.py:692} INFO - Sending
TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg',
task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00',
try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default
[2026-04-22T12:55:30.928+0000] {scheduler_job_runner.py:813} INFO -
TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg,
task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00,
map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00,
run_end_date=2026-04-22 12:55:29.120599+00:00, run_duration=240.882168,
state=up_for_retry, executor=CeleryExecutor(parallelism=25),
executor_state=success, try_number=2, max_tries=2, job_id=584141,
pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator,
queued_dttm=2026-04-22 12:55:24.445065+00:00, queued_by_job_id=584108, pid=3734
[2026-04-22T13:00:30.033+0000] {scheduler_job_runner.py:692} INFO - Sending
TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg',
task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00',
try_number=3, map_index=-1) to CeleryExecutor with priority 2 and queue default
[2026-04-22T13:06:55.545+0000] {scheduler_job_runner.py:813} INFO -
TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg,
task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00,
map_index=-1, run_start_date=2026-04-22 13:00:30.668381+00:00,
run_end_date=2026-04-22 13:06:51.673156+00:00, run_duration=381.004775,
state=failed, executor=CeleryExecutor(parallelism=25), executor_state=success,
try_number=3, max_tries=2, job_id=584157, pool=default_pool, queue=default,
priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22
13:06:46.703023+00:00, queued_by_job_id=583149, pid=4154
```
That sequence shows try 2 absolutely existed and reached `deferred` and then
`up_for_retry`, but afterward there is still no `task_instance_history` row for
try 2.
## Concrete example 2
Dag: `bd-sourcery-odm-snapshot-daily-bk-eks`
Task: `create_snapshot.bkng_data`
Run: `scheduled__2026-04-23T01:00:00+00:00`
Direct DB query from the scheduler container:
```text
{'current_try_number': 20, 'current_state': 'success', 'current_start_date':
'2026-04-24 12:31:41.179719+00:00', 'current_end_date': '2026-04-24
14:14:41.020332+00:00', 'current_hostname': '10.94.27.178',
'current_external_executor_id': '3b9419d0-e97f-489c-9cde-5ef072d99854'}
history_rows=
{'try_number': 3, 'state': 'failed', 'start_date': '2026-04-24
02:22:26.799412+00:00', 'end_date': '2026-04-24 02:46:03.074407+00:00',
'hostname': '10.94.30.159', 'external_executor_id':
'5122ae59-e21c-4e69-bc3b-29bc2f55944c'}
{'try_number': 6, 'state': 'failed', 'start_date': '2026-04-24
08:32:26.583491+00:00', 'end_date': '2026-04-24 08:38:02.100532+00:00',
'hostname': '10.94.23.156', 'external_executor_id':
'd965e200-196f-46f2-b6f5-40de834202df'}
{'try_number': 8, 'state': 'failed', 'start_date': '2026-04-24
09:08:22.046598+00:00', 'end_date': '2026-04-24 09:12:15.401738+00:00',
'hostname': '10.94.13.159', 'external_executor_id':
'83efe05b-119c-4083-bf5b-3fa49f8f9e94'}
{'try_number': 10, 'state': 'failed', 'start_date': '2026-04-24
09:18:25.321445+00:00', 'end_date': '2026-04-24 09:18:30.991704+00:00',
'hostname': '10.94.30.159', 'external_executor_id':
'cb110363-675e-471d-9c9d-06cdf88ff6b6'}
{'try_number': 11, 'state': 'failed', 'start_date': '2026-04-24
09:23:45.550726+00:00', 'end_date': '2026-04-24 09:23:59.497683+00:00',
'hostname': '10.94.30.159', 'external_executor_id':
'136b1a2c-8a1d-4947-ab78-300d0c5e911a'}
{'try_number': 14, 'state': 'failed', 'start_date': '2026-04-24
10:23:08.636760+00:00', 'end_date': '2026-04-24 10:31:06.771334+00:00',
'hostname': '10.94.17.127', 'external_executor_id':
'35776857-c608-4a13-95ba-9d900daeaa6f'}
{'try_number': 15, 'state': 'failed', 'start_date': '2026-04-24
10:54:08.495904+00:00', 'end_date': '2026-04-24 11:06:53.790567+00:00',
'hostname': '10.94.20.200', 'external_executor_id':
'd3ccbaa3-3504-4b00-b248-3b51e750e25e'}
{'try_number': 16, 'state': 'failed', 'start_date': '2026-04-24
11:06:56.858252+00:00', 'end_date': '2026-04-24 11:15:49.369237+00:00',
'hostname': '10.94.9.51', 'external_executor_id':
'c065e193-c0fd-4081-a45e-09ead9bd613b'}
{'try_number': 17, 'state': 'failed', 'start_date': '2026-04-24
11:15:57.006413+00:00', 'end_date': '2026-04-24 11:52:47.887098+00:00',
'hostname': '10.94.20.17', 'external_executor_id':
'c2c0fe70-2f9f-4901-af1b-b22fc603bb67'}
{'try_number': 18, 'state': 'failed', 'start_date': '2026-04-24
11:52:54.897007+00:00', 'end_date': '2026-04-24 12:16:18.761605+00:00',
'hostname': '10.94.22.79', 'external_executor_id':
'4641c634-b572-4cbc-84ce-9d8b983210c3'}
{'try_number': 19, 'state': 'failed', 'start_date': '2026-04-24
12:16:21.797645+00:00', 'end_date': '2026-04-24 12:31:33.573200+00:00',
'hostname': '10.94.27.178', 'external_executor_id':
'f2a1b243-562d-4334-ae68-dc4054b5e8c9'}
{'history_try_numbers': [3, 6, 8, 10, 11, 14, 15, 16, 17, 18, 19],
'missing_try_numbers': [1, 2, 4, 5, 7, 9, 12, 13]}
```
So this is not a one-off single-gap case. Here, a task that reached
`try_number=20` is missing many earlier attempts from `task_instance_history`.
## Related issue
This looks related in bug family, but not identical in executor or exact
symptom, to:
- #65366
That open issue reports retry-history loss symptoms on Airflow 3.1.x under
`KubernetesExecutor`.
I have not directly reproduced this on a running Airflow 3.2 deployment yet,
so I do not want to overclaim version scope here. But the retry-history
snapshot seam still appears materially similar in current 3.x code, so this may
not be isolated to the 2.x line.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]