hkc-8010 opened a new issue, #68010: URL: https://github.com/apache/airflow/issues/68010
A rescheduled sensor task can get stuck before entering the sensor `poke()` method when the worker restarts an attempt that already has `TaskReschedule` rows. Observed behavior: - The task attempt starts and emits task/listener startup logs. - The sensor never emits its normal `ExternalTaskSensor` "Poking for Dag ..." log line. - No external Dag run count request is made for that stuck restart. - The task remains running until it is killed. - A retry of the same task instance reaches `poke()` immediately and succeeds. The source path that matches this behavior is `BaseSensorOperator.execute()` calling `RuntimeTaskInstance.get_first_reschedule_date()` before `poke()`. When `task_reschedule_count > 0`, the Task SDK currently asks the supervisor to fetch the first `TaskReschedule.start_date`, which adds a pre-poke supervisor/API round trip for rescheduled sensors. Expected behavior: A rescheduled sensor restart should not need an extra pre-poke supervisor/API request for metadata the API server already knows when it creates the task run context. The worker should receive the first reschedule start date in `TIRunContext` and use it directly, while preserving the existing supervisor request as a compatibility fallback. Proposed fix: Add `first_task_reschedule_start_date` to `TIRunContext`, populate it in the Execution API run response when the task instance has reschedules, and have the Task SDK use that value before falling back to `GetTaskRescheduleStartDate`. This avoids the pre-poke blocking point and keeps older API/server combinations compatible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
