moomindani commented on PR #68358: URL: https://github.com/apache/airflow/pull/68358#issuecomment-4747002622
This is a great write-up — the cluster-lifecycle split (native retries = in-flight, same cluster; repair = terminal run, fresh cluster + failed/dependent tasks) is exactly the right mental model, and it matches the failures you described. I'd go with splitting: land native task retries (`max_retries` / `min_retry_interval_millis` on the task-group task spec) as its own PR first, and keep this repair PR as the scoped follow-up for what retries can't reach (fresh-cluster recovery on a degraded driver/node). Reasoning: - Native retries is the smaller, lower-risk change and covers the majority of the real cases you hit (transient source outage, flaky install, unresponsive kernel), so it delivers most of the value on its own and is easy to review. - This repair PR is substantial (coordinator injection, sync/deferrable parity, the shared-deadline coordination). Landing it after retries exist lets it be scoped precisely to the fresh-cluster case and reviewed on its own merits, instead of carrying the "why not just retries?" question. Two things worth folding into the native-retries PR while you're there: - **Document the Airflow `retries` interaction.** As you noted, Airflow task-level `retries` in the task group just re-run the monitor (no-op against an already-terminal sub-run), so users should reach for the Databricks-side `max_retries` instead. Calling that out will save people the exact confusion you described. - **A line on the retries-vs-repair boundary** — which failure classes retries cover vs. which actually need a fresh cluster — so the follow-up's scope is clear up front. That's my read as a reviewer; @eladkal may have a preference on sequencing too. --- Drafted-by: Claude Code (Opus 4.8); reviewed by @moomindani before posting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
