moomindani commented on code in PR #68358: URL: https://github.com/apache/airflow/pull/68358#discussion_r3410463882
########## providers/databricks/src/airflow/providers/databricks/triggers/databricks.py: ########## @@ -21,11 +21,24 @@ import time from typing import Any -from airflow.providers.databricks.hooks.databricks import DatabricksHook -from airflow.providers.databricks.utils.databricks import extract_failed_task_errors_async +from airflow.providers.databricks.hooks.databricks import DatabricksHook, RunState +from airflow.providers.databricks.utils.databricks import ( + build_repair_run_json, + extract_failed_task_errors_async, + find_new_workflow_task_attempt, +) from airflow.providers.databricks.utils.retry import validate_deferrable_databricks_retry_args from airflow.triggers.base import BaseTrigger, TriggerEvent +# Tolerate this many consecutive polls of stale Databricks state in +# ``DatabricksWorkflowRepairWaitTrigger``: when a sub-run reports terminal failure, a +# repair-triggered new sub-run can take a moment to appear in the parent's tasks list — the +# waiter polls up to this many times before declaring the parent terminally failed without a +# new attempt. The coordinator uses a configurable wall-clock timeout instead (see +# ``workflow_repair_timeout``), since the post-``repair_run`` eventual-consistency +# window can stretch into minutes when Databricks is slow. +WORKFLOW_REPAIR_GRACE_POLLS = 3 Review Comment: This is the crux of the waiter/coordinator timing contract. The waiter side (`DatabricksWorkflowRepairWaitTrigger` and `_sync_wait_for_new_sub_run_attempt`) gives up after 3 grace polls of "parent terminal + `repair_history` not grown" — 90s at the default 30s period. The coordinator, by contrast, tolerates Databricks-side eventual consistency for `workflow_repair_timeout` (300s) after `repair_run`. Since the coordinator and the waiter poll on independent timers, on a slow workspace the waiter can hit its 3-poll limit and emit `parent_failed` (failing the downstream task) before the coordinator's repair shows up in `repair_history` or a new attempt appears. The `repair_history`-growth reset only helps once growth is observed. Could you confirm the timing assumption holds, or align this tolerance with `workflow_repair_timeout` (a wall-clock bound) so both sides agree on how long Databricks is allowed to be slow? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
