github-actions[bot] opened a new pull request, #67183:
URL: https://github.com/apache/airflow/pull/67183
* Fail closed when supervisor IPC fails on a non-success terminal state
When a task FAILED / SKIPPED / etc. and the IPC send of the
terminal-state message to the supervisor itself raised, the existing
finally block logged the exception and let run() return normally.
The task subprocess then exited with code 0, which the supervisor
final_state property maps to SUCCESS for an exit_code-0 process
without a _terminal_state (the supervisor never received the
message). A genuine task FAILURE was silently being upgraded to
SUCCESS on transient IPC failures, breaking downstream pipeline
correctness without any signal.
Exit non-zero from the finally block when the terminal state is
anything other than SUCCESS, so the supervisor's final_state
correctly classifies as FAILED (or UP_FOR_RETRY when retries are
configured). The SUCCESS exemption preserves the existing softening
for the legitimate scenario where the supervisor rejects the
terminal-state send with a 409 because the server already
terminalised the TI -- covered by
test_run_swallows_supervisor_terminal_send_failure, which continues
to pass.
New regression test: when the task fails and the supervisor IPC send
raises (BrokenPipeError simulating a dead Unix socket), run() now
raises SystemExit(1).
Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-006).
* Address review comments: defer fail-closed exit, narrow guard
- Narrow the fail-closed guard from `state != SUCCESS` to FAILED /
UP_FOR_RETRY only (kaxil). SKIPPED / UP_FOR_RESCHEDULE / DEFERRED
would otherwise be mismapped to FAILED by supervisor's final_state,
which is strictly worse than the default mapping.
- Stop calling sys.exit(1) inside run()'s finally — SystemExit is a
BaseException so main()'s `except Exception:` would not catch it
and finalize() would be skipped, silently dropping
on_failure_callback / on_retry_callback / listener hooks /
email_on_failure / email_on_retry on the same IPC failure (kaxil).
Signal via `_terminal_state_send_failed` on the ti and let main()
sys.exit(1) after finalize() has run.
- Remove the redundant inline `from ... import run` in the test
(Lee-W, kaxil) — `run` is already imported at module level.
- Rework the existing regression test to assert the new contract
(run() returns FAILED + sets the flag) and add a listener-based
test that locks in callbacks/listeners still firing on the
IPC-broken path (kaxil).
* Declare _terminal_state_send_failed on RuntimeTaskInstance for mypy
The fail-closed path in _handle_current_task_failure set
ti._terminal_state_send_failed = True
dynamically without a class-level declaration, so mypy raised [attr-defined].
Add the field as a Pydantic PrivateAttr (bool, default False) matching the
existing _cached_template_context pattern. No runtime behavior change —
getattr(ti, '_terminal_state_send_failed', False) still returns False for
the unset case, now via the PrivateAttr default instead of the getattr
fallback.
---------
(cherry picked from commit 7e91517ffd01f0608f21349ad8aa86595617d220)
Co-authored-by: Jarek Potiuk <[email protected]>
Co-authored-by: vatsrahul1001 <[email protected]>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]