The GitHub Actions job "Tests (AMD)" on airflow.git/backport-173c2a1-v3-2-test 
has failed.
Run started by GitHub user vatsrahul1001 (triggered by vatsrahul1001).

Head commit for run:
65504bd1d4fac592bdfc1e3ddfd1d46f9ce8d957 / Jarek Potiuk <[email protected]>
Recover stuck TIs when direct terminal-state API call fails (#66574)

* Recover stuck TIs when direct terminal-state API call fails

The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask,
and RescheduleTask set _terminal_state BEFORE calling the matching
client.task_instances.{succeed,retry,defer,reschedule}() API. If that
API call raised (transient network blip, server 5xx, etc.),
_terminal_state was set on the supervisor but the server never saw
the transition. The supervisor's update_task_state_if_needed then
saw final_state in STATES_SENT_DIRECTLY and short-circuited the
recovery finish() call -- leaving the TaskInstance stuck RUNNING
on the server forever, blocking downstream dependencies and
triggering false alerts.

Two-part fix:

1. Make the direct API call FIRST. Only set _terminal_state and the
   new _terminal_state_synced_to_server flag after the call returns
   successfully. If the API raises, both stay unset and the exception
   propagates to handle_requests, where the existing catch-all sends
   an ErrorResponse to the task subprocess.

2. Have update_task_state_if_needed always call finish() when
   _terminal_state_synced_to_server is False, regardless of what
   final_state happens to return. The finish() API takes the state
   value, so a SUCCESS / DEFERRED / etc. transition that originally
   failed is re-attempted via finish() on subprocess exit.
   Pre-existing semantics for the no-direct-API states (FAILED,
   UP_FOR_RETRY without RetryTask, etc.) preserved -- those land in
   the same finish() branch.

Tests added:

- _terminal_state not set when succeed() raises.
- update_task_state_if_needed calls finish() when synced flag is
  False, even with final_state == SUCCESS.
- update_task_state_if_needed skips finish() when synced flag is
  True (preserves the existing happy-path optimisation).

Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-007).

* Refactor terminal-state dispatch and parametrize tests across all 4 states

Address review feedback on #66574:

- Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch
  for succeed / retry / defer / reschedule lives in one place. Both
  `_handle_request` and `_replay_pending_terminal_state_msg` now go
  through it instead of duplicating the four-branch isinstance chain.
- Parametrize the two recovery tests over all four terminal-state
  message types (was only Succeed + Defer); add UP_FOR_RETRY and
  UP_FOR_RESCHEDULE coverage.

* Narrow _pending_terminal_state_msg type to satisfy mypy

The field was annotated as BaseModel | None, but _send_terminal_state_msg
expects SucceedTask | RetryTask | DeferTask | RescheduleTask. mypy
couldn't prove the narrowing at the _replay_pending_terminal_state_msg
call site. Tighten the field type to the exact union the setter assigns
and the consumer accepts.

---------

Co-authored-by: vatsrahul1001 <[email protected]>
Co-authored-by: Rahul Vats <[email protected]>
(cherry picked from commit 173c2a1806dd087272ec287fb923917630ef8f81)

Report URL: https://github.com/apache/airflow/actions/runs/26120400424

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to