hkc-8010 commented on PR #68008:
URL: https://github.com/apache/airflow/pull/68008#issuecomment-4715145642

   Thanks @jscheffl, good question.
   
   I think this is OK from the state-model side and does not need an extra code 
change. This makes the heartbeat-timeout path consistent with the existing 
scheduler path for externally killed tasks: the scheduler creates/sends the 
`TaskCallbackRequest`, then calls `ti.handle_failure(...)` to converge the 
metastore state.
   
   The callback itself is still executed asynchronously from the callback 
request payload. That payload is built before `handle_failure()` and uses the 
simplified execution API TI datamodel, so it is not passing a live ORM object 
whose state mutates underneath the callback. If callback code explicitly 
queries the metastore, it will now see the task as `failed`, which seems like 
the correct state once the scheduler has determined the task heartbeat timed 
out.
   
   The main issue this PR fixes is that leaving the DB row as `running` after 
queueing the callback lets the next scheduler heartbeat find the same TI again 
and enqueue another same-try failure callback. Moving through 
`handle_failure()` in the same purge pass closes that gap while keeping the 
callback asynchronous.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to