karakanb commented on issue #43948: URL: https://github.com/apache/airflow/issues/43948#issuecomment-2507302897
I looked into this a bit but it seems like there's a fundamental issue here, I'll try to explain below. The expected behavior would be to have a sensor that can run with retries, in case something fails during the sensor check, e.g. infra issues. The retries are not about the sensor not finding what it was supposed to, e.g. "the task is not there", but to recover from infra failures, e.g. the database being temporarily unavailable. This behavior works as expected with sensors in general. However, when combining retries on sensors with timeouts, that's where things start getting interesting: - When the user sets a timeout, the intention is "wait this long _from the beginning of the first try_", which is a very important factor that is also highlighted in the [Timeouts section of the docs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#timeouts). This behavior seems to work correctly with `reschedule` mode thanks to the `task_reschedule` table that records the start timestamp for the first try. - However, when deferrable mode is used, the timeouts do not work with retries since there's no way to retrieve the start time of the first attempt of a task instance. It seems like the user would want the same behavior between deferred and non-deferred versions of the sensor for the timeouts with retries, but I couldn't find a way to solve it without adding a new table to airflow. is the original first start time information saved somewhere? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org