resume in 3.1.8 [airflow]

via GitHub Thu, 16 Apr 2026 04:31:21 -0700


hkc-8010 commented on issue #65011:
URL: https://github.com/apache/airflow/issues/65011#issuecomment-4259708460


   Thanks @amoghrajesh — I checked the backend logs and source against the 
heartbeat-timeout theory.
   
   What I found:
   
   - In source, the scheduler heartbeat-timeout path logs `Detected a task 
instance without a heartbeat` and sends a callback with `xcom_keys_to_clear=[]`.
   - I searched the scheduler log windows for both failing cases and did not 
find hits for:
     - `Detected a task instance without a heartbeat`
     - `without a heartbeat`
     - `heartbeat timeout`
   
   For the non-deferrable repro `manual__2026-04-10T15:34:19.499833+00:00`, 
backend worker logs show exactly one `Executing workload in Celery` for each 
try:
   - try 1 at `2026-04-10T15:34:20.517Z`
   - try 2 at `2026-04-10T15:41:59.603Z`
   
   Both were on the same worker host, and I did not find an extra hidden 
earlier execution, revoke/adopt, or heartbeat-timeout signal in between.
   
   So this weakens the theory that the scheduler killed a prior execution for 
missing heartbeat while the old worker was still alive and committing XComs. I 
can’t prove that path is impossible, but I do not currently have backend 
evidence supporting it.
   
   What still seems true:
   
   - if `ti.next_method` is set, the execution API intentionally skips XCom 
cleanup
   - but that alone still does not explain the duplicate `return_value` on the 
first resumed leg of try 1 after a successful trigger completion
   - and it also does not explain the non-deferrable try-1 
`glue_job_run_details` `409`
   
   So the stronger remaining bucket still looks like a stale-row / 
hidden-writer / XCom lifecycle gap outside the specific heartbeat-timeout path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] GlueJobOperator can hit duplicate XCom keys across retry / deferral / resume in 3.1.8 [airflow]

Reply via email to