Beat-Nick commented on PR #68358:
URL: https://github.com/apache/airflow/pull/68358#issuecomment-4720894208

   Thanks for the review, good points all around. Here are the changes I've 
made:
   
   1. **Grace-window asymmetry.**  Both sides now share one wall-clock. I 
dropped the `WORKFLOW_REPAIR_GRACE_POLLS` counter. The coordinator and wait 
trigger both derive their deadline from compute_repair_deadline(), which is the 
parent run's terminal end_time plus a shared workflow_repair_timeout, so they 
expire together and the waiter can't fail the downstream task before the repair 
window elapses. I also dropped the default from 300s to 180s, since the repair 
API usually responds in milliseconds. Full details in a new subsection of the 
PR description.
   
   2. **Real-environment validation.** I've tested airflow Dags against my 
Databricks workspace: sync, deferrable, max-repair threshold, and a job with 
mixed operator types. All behaved as expected. I'm happy to capture a specific 
run if useful.
   
   3. **`start_time` assumption.** Rather than just commenting on the 
assumption, I made the code enforce it: the function now requires a populated 
start_time, so a not-yet-started attempt yields None and we keep polling 
instead of latching onto it. This matters most in the original_start_time is 
None case — without the requirement, that branch would match any non-original 
attempt, including one Databricks has accepted but not yet started. start_time 
is [set when the run is 
accepted](https://docs.databricks.com/api/workspace/jobs/getrun#start_time) 
(even while the cluster is still booting).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to