The GitHub Actions job "Tests (ARM)" on airflow.git/main has failed. Run started by GitHub user potiuk (triggered by potiuk).
Head commit for run: 7e91517ffd01f0608f21349ad8aa86595617d220 / Jarek Potiuk <[email protected]> Fail closed when supervisor IPC fails on a non-success terminal state (#66573) * Fail closed when supervisor IPC fails on a non-success terminal state When a task FAILED / SKIPPED / etc. and the IPC send of the terminal-state message to the supervisor itself raised, the existing finally block logged the exception and let run() return normally. The task subprocess then exited with code 0, which the supervisor final_state property maps to SUCCESS for an exit_code-0 process without a _terminal_state (the supervisor never received the message). A genuine task FAILURE was silently being upgraded to SUCCESS on transient IPC failures, breaking downstream pipeline correctness without any signal. Exit non-zero from the finally block when the terminal state is anything other than SUCCESS, so the supervisor's final_state correctly classifies as FAILED (or UP_FOR_RETRY when retries are configured). The SUCCESS exemption preserves the existing softening for the legitimate scenario where the supervisor rejects the terminal-state send with a 409 because the server already terminalised the TI -- covered by test_run_swallows_supervisor_terminal_send_failure, which continues to pass. New regression test: when the task fails and the supervisor IPC send raises (BrokenPipeError simulating a dead Unix socket), run() now raises SystemExit(1). Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-006). * Address review comments: defer fail-closed exit, narrow guard - Narrow the fail-closed guard from `state != SUCCESS` to FAILED / UP_FOR_RETRY only (kaxil). SKIPPED / UP_FOR_RESCHEDULE / DEFERRED would otherwise be mismapped to FAILED by supervisor's final_state, which is strictly worse than the default mapping. - Stop calling sys.exit(1) inside run()'s finally — SystemExit is a BaseException so main()'s `except Exception:` would not catch it and finalize() would be skipped, silently dropping on_failure_callback / on_retry_callback / listener hooks / email_on_failure / email_on_retry on the same IPC failure (kaxil). Signal via `_terminal_state_send_failed` on the ti and let main() sys.exit(1) after finalize() has run. - Remove the redundant inline `from ... import run` in the test (Lee-W, kaxil) — `run` is already imported at module level. - Rework the existing regression test to assert the new contract (run() returns FAILED + sets the flag) and add a listener-based test that locks in callbacks/listeners still firing on the IPC-broken path (kaxil). * Declare _terminal_state_send_failed on RuntimeTaskInstance for mypy The fail-closed path in _handle_current_task_failure set ti._terminal_state_send_failed = True dynamically without a class-level declaration, so mypy raised [attr-defined]. Add the field as a Pydantic PrivateAttr (bool, default False) matching the existing _cached_template_context pattern. No runtime behavior change — getattr(ti, '_terminal_state_send_failed', False) still returns False for the unset case, now via the PrivateAttr default instead of the getattr fallback. --------- Co-authored-by: vatsrahul1001 <[email protected]> Report URL: https://github.com/apache/airflow/actions/runs/26103253278 With regards, GitHub Actions via GitBox --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
