Beat-Nick commented on PR #68358: URL: https://github.com/apache/airflow/pull/68358#issuecomment-4720894208
Thanks for the review, good points all around. Here are the changes I've made: 1. **Grace-window asymmetry.** Both sides now share one wall-clock. I dropped the `WORKFLOW_REPAIR_GRACE_POLLS` counter. The coordinator and wait trigger both derive their deadline from compute_repair_deadline(), which is the parent run's terminal end_time plus a shared workflow_repair_timeout, so they expire together and the waiter can't fail the downstream task before the repair window elapses. I also dropped the default from 300s to 180s, since the repair API usually responds in milliseconds. Full details in a new subsection of the PR description. 2. **Real-environment validation.** I've tested airflow Dags against my Databricks workspace: sync, deferrable, max-repair threshold, and a job with mixed operator types. All behaved as expected. I'm happy to capture a specific run if useful. 3. **`start_time` assumption.** Rather than just commenting on the assumption, I made the code enforce it: the function now requires a populated start_time, so a not-yet-started attempt yields None and we keep polling instead of latching onto it. This matters most in the original_start_time is None case — without the requirement, that branch would match any non-original attempt, including one Databricks has accepted but not yet started. start_time is [set when the run is accepted](https://docs.databricks.com/api/workspace/jobs/getrun#start_time) (even while the cluster is still booting). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
