hkc-8010 opened a new issue, #67238:
URL: https://github.com/apache/airflow/issues/67238

   ### Apache Airflow version
   
   2.11.2
   
   ### If "Other Airflow 2 version" selected, which one?
   
   N/A
   
   ### What happened?
   
   We have multiple anonymized production incidents on Airflow 2.11.2 with 
`CeleryExecutor` where retry attempts that definitely ran are missing from 
`task_instance_history`.
   
   The visible user symptom is that the Task logs / tries UI does not show all 
real attempts. But the more important observation is that this is not just a UI 
problem: one or more retry attempts are actually absent from 
`task_instance_history` even though scheduler and worker logs prove those 
attempts executed.
   
   Observed pattern in affected runs:
   
   - the surviving `task_instance` row reflects the current or final attempt
   - one or more earlier retry attempts are missing from `task_instance_history`
   - `/tries` therefore omits those attempts because it is built from 
`task_instance_history` plus the current `task_instance`
   - in at least one case, logs for a missing attempt were still retrievable 
directly, which makes the UI inconsistency more confusing
   
   ### What you think should happen instead?
   
   Every retry attempt that actually executes should be preserved in 
`task_instance_history`.
   
   If a task reaches `deferred`, `up_for_retry`, `failed`, or another terminal 
transition for a given try number, that try should still exist in 
`task_instance_history` afterward, and `/tries` should list it.
   
   ### How to reproduce
   
   I do not yet have a minimal standalone reproducer, but the repeated field 
pattern is:
   
   1. Run Airflow 2.11.2 with `CeleryExecutor`
   2. Use a task that can retry, including cases that may defer and then retry 
again
   3. Let the task execute multiple tries
   4. Inspect scheduler logs for the task/run and confirm a given try number 
was sent to the executor and finished
   5. Query `task_instance_history` for that same task/run
   
   Observed result in affected runs:
   
   - `task_instance.try_number` advances normally
   - some earlier try numbers below the current try are missing from 
`task_instance_history`
   - `/tries` omits those missing attempts
   
   ### Operating System
   
   Linux / Kubernetes
   
   ### Versions of Apache Airflow Providers
   
   Not yet isolated to a provider-specific issue.
   
   ### Deployment
   
   Other Kubernetes deployment
   
   ### Deployment details
   
   These incidents were observed on Astro Hosted deployments running Runtime 
13.6.0 (`Airflow 2.11.2+astro.2`) with:
   
   - `CeleryExecutor`
   - two scheduler replicas present
   - PostgreSQL metadata DB
   
   I am filing this upstream because the symptom is directly in core retry 
history persistence (`task_instance_history`), not in a provider package.
   
   ### Anything else?
   
   This does **not** look like a UI-only bug.
   
   ## Concrete example 1
   
   Dag: `cs-forecast-cl-data-preprocessing-bk-eks-intg`  
   Task: `publish_data.pyspark`  
   Run: `manual__2026-04-20T13:05:04+00:00`
   
   Direct DB query from the scheduler container after the incident:
   
   ```text
   {'current_try_number': 3, 'current_state': 'failed', 'current_start_date': 
'2026-04-22 13:00:30.668381+00:00', 'current_end_date': '2026-04-22 
13:06:51.673156+00:00', 'current_hostname': '10.94.10.200', 
'current_external_executor_id': 'a29904ce-e180-4b40-80d6-366e3a3b8cd2'}
   history_rows=
   {'try_number': 1, 'state': 'success', 'start_date': '2026-04-20 
13:19:53.454647+00:00', 'end_date': '2026-04-20 13:25:16.544094+00:00', 
'hostname': '10.94.23.102', 'external_executor_id': 
'b7ba76f7-d337-4634-a5b6-989dd041eef1'}
   {'history_try_numbers': [1], 'missing_try_numbers': [2]}
   ```
   
   So the current row says `try_number=3`, but `task_instance_history` contains 
only try `1`. Try `2` is missing.
   
   Scheduler and worker logs prove try 2 actually ran:
   
   ```text
   [2026-04-22T12:51:27.388+0000] {scheduler_job_runner.py:692} INFO - Sending 
TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', 
task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', 
try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default
   
   [2026-04-22 12:51:27,841: INFO/ForkPoolWorker-3] Running <TaskInstance: 
cs-forecast-cl-data-preprocessing-bk-eks-intg.publish_data.pyspark 
manual__2026-04-20T13:05:04+00:00 [queued]> on host 10.94.10.200
   
   [2026-04-22T12:51:46.816+0000] {scheduler_job_runner.py:813} INFO - 
TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, 
task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, 
map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00, 
run_end_date=None, run_duration=323.089447, state=deferred, 
executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=2, 
max_tries=2, job_id=584134, pool=default_pool, queue=default, 
priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 
12:51:27.386468+00:00, queued_by_job_id=583149, pid=3570
   
   [2026-04-22T12:55:24.447+0000] {scheduler_job_runner.py:692} INFO - Sending 
TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', 
task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', 
try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default
   
   [2026-04-22T12:55:30.928+0000] {scheduler_job_runner.py:813} INFO - 
TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, 
task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, 
map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00, 
run_end_date=2026-04-22 12:55:29.120599+00:00, run_duration=240.882168, 
state=up_for_retry, executor=CeleryExecutor(parallelism=25), 
executor_state=success, try_number=2, max_tries=2, job_id=584141, 
pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator, 
queued_dttm=2026-04-22 12:55:24.445065+00:00, queued_by_job_id=584108, pid=3734
   
   [2026-04-22T13:00:30.033+0000] {scheduler_job_runner.py:692} INFO - Sending 
TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', 
task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', 
try_number=3, map_index=-1) to CeleryExecutor with priority 2 and queue default
   
   [2026-04-22T13:06:55.545+0000] {scheduler_job_runner.py:813} INFO - 
TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, 
task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, 
map_index=-1, run_start_date=2026-04-22 13:00:30.668381+00:00, 
run_end_date=2026-04-22 13:06:51.673156+00:00, run_duration=381.004775, 
state=failed, executor=CeleryExecutor(parallelism=25), executor_state=success, 
try_number=3, max_tries=2, job_id=584157, pool=default_pool, queue=default, 
priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 
13:06:46.703023+00:00, queued_by_job_id=583149, pid=4154
   ```
   
   That sequence shows try 2 absolutely existed and reached `deferred` and then 
`up_for_retry`, but afterward there is still no `task_instance_history` row for 
try 2.
   
   ## Concrete example 2
   
   Dag: `bd-sourcery-odm-snapshot-daily-bk-eks`  
   Task: `create_snapshot.bkng_data`  
   Run: `scheduled__2026-04-23T01:00:00+00:00`
   
   Direct DB query from the scheduler container:
   
   ```text
   {'current_try_number': 20, 'current_state': 'success', 'current_start_date': 
'2026-04-24 12:31:41.179719+00:00', 'current_end_date': '2026-04-24 
14:14:41.020332+00:00', 'current_hostname': '10.94.27.178', 
'current_external_executor_id': '3b9419d0-e97f-489c-9cde-5ef072d99854'}
   history_rows=
   {'try_number': 3, 'state': 'failed', 'start_date': '2026-04-24 
02:22:26.799412+00:00', 'end_date': '2026-04-24 02:46:03.074407+00:00', 
'hostname': '10.94.30.159', 'external_executor_id': 
'5122ae59-e21c-4e69-bc3b-29bc2f55944c'}
   {'try_number': 6, 'state': 'failed', 'start_date': '2026-04-24 
08:32:26.583491+00:00', 'end_date': '2026-04-24 08:38:02.100532+00:00', 
'hostname': '10.94.23.156', 'external_executor_id': 
'd965e200-196f-46f2-b6f5-40de834202df'}
   {'try_number': 8, 'state': 'failed', 'start_date': '2026-04-24 
09:08:22.046598+00:00', 'end_date': '2026-04-24 09:12:15.401738+00:00', 
'hostname': '10.94.13.159', 'external_executor_id': 
'83efe05b-119c-4083-bf5b-3fa49f8f9e94'}
   {'try_number': 10, 'state': 'failed', 'start_date': '2026-04-24 
09:18:25.321445+00:00', 'end_date': '2026-04-24 09:18:30.991704+00:00', 
'hostname': '10.94.30.159', 'external_executor_id': 
'cb110363-675e-471d-9c9d-06cdf88ff6b6'}
   {'try_number': 11, 'state': 'failed', 'start_date': '2026-04-24 
09:23:45.550726+00:00', 'end_date': '2026-04-24 09:23:59.497683+00:00', 
'hostname': '10.94.30.159', 'external_executor_id': 
'136b1a2c-8a1d-4947-ab78-300d0c5e911a'}
   {'try_number': 14, 'state': 'failed', 'start_date': '2026-04-24 
10:23:08.636760+00:00', 'end_date': '2026-04-24 10:31:06.771334+00:00', 
'hostname': '10.94.17.127', 'external_executor_id': 
'35776857-c608-4a13-95ba-9d900daeaa6f'}
   {'try_number': 15, 'state': 'failed', 'start_date': '2026-04-24 
10:54:08.495904+00:00', 'end_date': '2026-04-24 11:06:53.790567+00:00', 
'hostname': '10.94.20.200', 'external_executor_id': 
'd3ccbaa3-3504-4b00-b248-3b51e750e25e'}
   {'try_number': 16, 'state': 'failed', 'start_date': '2026-04-24 
11:06:56.858252+00:00', 'end_date': '2026-04-24 11:15:49.369237+00:00', 
'hostname': '10.94.9.51', 'external_executor_id': 
'c065e193-c0fd-4081-a45e-09ead9bd613b'}
   {'try_number': 17, 'state': 'failed', 'start_date': '2026-04-24 
11:15:57.006413+00:00', 'end_date': '2026-04-24 11:52:47.887098+00:00', 
'hostname': '10.94.20.17', 'external_executor_id': 
'c2c0fe70-2f9f-4901-af1b-b22fc603bb67'}
   {'try_number': 18, 'state': 'failed', 'start_date': '2026-04-24 
11:52:54.897007+00:00', 'end_date': '2026-04-24 12:16:18.761605+00:00', 
'hostname': '10.94.22.79', 'external_executor_id': 
'4641c634-b572-4cbc-84ce-9d8b983210c3'}
   {'try_number': 19, 'state': 'failed', 'start_date': '2026-04-24 
12:16:21.797645+00:00', 'end_date': '2026-04-24 12:31:33.573200+00:00', 
'hostname': '10.94.27.178', 'external_executor_id': 
'f2a1b243-562d-4334-ae68-dc4054b5e8c9'}
   {'history_try_numbers': [3, 6, 8, 10, 11, 14, 15, 16, 17, 18, 19], 
'missing_try_numbers': [1, 2, 4, 5, 7, 9, 12, 13]}
   ```
   
   So this is not a one-off single-gap case. Here, a task that reached 
`try_number=20` is missing many earlier attempts from `task_instance_history`.
   
   ## Related issue
   
   This looks related in bug family, but not identical in executor or exact 
symptom, to:
   
   - #65366
   
   That open issue reports retry-history loss symptoms on Airflow 3.1.x under 
`KubernetesExecutor`.
   
   I have not directly reproduced this on a running Airflow 3.2 deployment yet, 
so I do not want to overclaim version scope here. But the retry-history 
snapshot seam still appears materially similar in current 3.x code, so this may 
not be isolated to the 2.x line.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to