moomindani commented on code in PR #68358:
URL: https://github.com/apache/airflow/pull/68358#discussion_r3410463882


##########
providers/databricks/src/airflow/providers/databricks/triggers/databricks.py:
##########
@@ -21,11 +21,24 @@
 import time
 from typing import Any
 
-from airflow.providers.databricks.hooks.databricks import DatabricksHook
-from airflow.providers.databricks.utils.databricks import 
extract_failed_task_errors_async
+from airflow.providers.databricks.hooks.databricks import DatabricksHook, 
RunState
+from airflow.providers.databricks.utils.databricks import (
+    build_repair_run_json,
+    extract_failed_task_errors_async,
+    find_new_workflow_task_attempt,
+)
 from airflow.providers.databricks.utils.retry import 
validate_deferrable_databricks_retry_args
 from airflow.triggers.base import BaseTrigger, TriggerEvent
 
+# Tolerate this many consecutive polls of stale Databricks state in
+# ``DatabricksWorkflowRepairWaitTrigger``: when a sub-run reports terminal 
failure, a
+# repair-triggered new sub-run can take a moment to appear in the parent's 
tasks list — the
+# waiter polls up to this many times before declaring the parent terminally 
failed without a
+# new attempt. The coordinator uses a configurable wall-clock timeout instead 
(see
+# ``workflow_repair_timeout``), since the post-``repair_run`` 
eventual-consistency
+# window can stretch into minutes when Databricks is slow.
+WORKFLOW_REPAIR_GRACE_POLLS = 3

Review Comment:
   This is the crux of the waiter/coordinator timing contract. The waiter side 
(`DatabricksWorkflowRepairWaitTrigger` and 
`_sync_wait_for_new_sub_run_attempt`) gives up after 3 grace polls of "parent 
terminal + `repair_history` not grown" — 90s at the default 30s period. The 
coordinator, by contrast, tolerates Databricks-side eventual consistency for 
`workflow_repair_timeout` (300s) after `repair_run`. Since the coordinator and 
the waiter poll on independent timers, on a slow workspace the waiter can hit 
its 3-poll limit and emit `parent_failed` (failing the downstream task) before 
the coordinator's repair shows up in `repair_history` or a new attempt appears. 
The `repair_history`-growth reset only helps once growth is observed. Could you 
confirm the timing assumption holds, or align this tolerance with 
`workflow_repair_timeout` (a wall-clock bound) so both sides agree on how long 
Databricks is allowed to be slow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to