Beat-Nick commented on PR #68358: URL: https://github.com/apache/airflow/pull/68358#issuecomment-4744193603
> I'd like to open a design discussion rather than request changes — mostly about how this positions itself relative to the other recovery mechanisms in play. Thanks, digging into these shifted how I think this should be positioned. Context on how I got here: the failures driving this were all transient (a ~5 min upstream-source outage, a library that failed to install, a Python kernel going unresponsive), and a job repair cleared each one. So that's what I reached for and what this PR builds. But with fresh eyes, perhaps native task retries may be the better tool here, and the task group exposes neither today. **Repair vs. native retries.** Complementary, split by cluster lifecycle. Native retries (`max_retries` / `min_retry_interval_millis`) re-run the failed task in-flight on the same cluster; `repair_run` acts on a terminal run, so it gets a fresh cluster and can re-run failed and dependent tasks. For the three failures above, retries are the better primary tool. So retries as first line, `workflow_repair_attempts` as the run-level backstop for what retries can't reach (fresh-cluster recovery on a degraded driver/node). **Interaction with Airflow `retries`.** If retries are set on the task level, they will work how they today, which is admittedly confusing. On a failure Airflow retries only re-run the monitor, where it finds nothing to repair and fails again. **Path forward.** I'd like to split native retries into its own PR first and keep this one as a potential follow-up for the cases retries can't cover. Sound right, or would you rather both land together here with the positioning documented up front? --- Drafted-by: Claude Code (Opus 4.8); reviewed by @BeatNick before posting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
