hkc-8010 commented on issue #65011:
URL: https://github.com/apache/airflow/issues/65011#issuecomment-4259708460
Thanks @amoghrajesh — I checked the backend logs and source against the
heartbeat-timeout theory.
What I found:
- In source, the scheduler heartbeat-timeout path logs `Detected a task
instance without a heartbeat` and sends a callback with `xcom_keys_to_clear=[]`.
- I searched the scheduler log windows for both failing cases and did not
find hits for:
- `Detected a task instance without a heartbeat`
- `without a heartbeat`
- `heartbeat timeout`
For the non-deferrable repro `manual__2026-04-10T15:34:19.499833+00:00`,
backend worker logs show exactly one `Executing workload in Celery` for each
try:
- try 1 at `2026-04-10T15:34:20.517Z`
- try 2 at `2026-04-10T15:41:59.603Z`
Both were on the same worker host, and I did not find an extra hidden
earlier execution, revoke/adopt, or heartbeat-timeout signal in between.
So this weakens the theory that the scheduler killed a prior execution for
missing heartbeat while the old worker was still alive and committing XComs. I
can’t prove that path is impossible, but I do not currently have backend
evidence supporting it.
What still seems true:
- if `ti.next_method` is set, the execution API intentionally skips XCom
cleanup
- but that alone still does not explain the duplicate `return_value` on the
first resumed leg of try 1 after a successful trigger completion
- and it also does not explain the non-deferrable try-1
`glue_job_run_details` `409`
So the stronger remaining bucket still looks like a stale-row /
hidden-writer / XCom lifecycle gap outside the specific heartbeat-timeout path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]