hkc-8010 commented on PR #68008: URL: https://github.com/apache/airflow/pull/68008#issuecomment-4715145642
Thanks @jscheffl, good question. I think this is OK from the state-model side and does not need an extra code change. This makes the heartbeat-timeout path consistent with the existing scheduler path for externally killed tasks: the scheduler creates/sends the `TaskCallbackRequest`, then calls `ti.handle_failure(...)` to converge the metastore state. The callback itself is still executed asynchronously from the callback request payload. That payload is built before `handle_failure()` and uses the simplified execution API TI datamodel, so it is not passing a live ORM object whose state mutates underneath the callback. If callback code explicitly queries the metastore, it will now see the task as `failed`, which seems like the correct state once the scheduler has determined the task heartbeat timed out. The main issue this PR fixes is that leaving the DB row as `running` after queueing the callback lets the next scheduler heartbeat find the same TI again and enqueue another same-try failure callback. Moving through `handle_failure()` in the same purge pass closes that gap while keeping the callback asynchronous. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
