github-actions[bot] opened a new pull request, #67183:
URL: https://github.com/apache/airflow/pull/67183

   * Fail closed when supervisor IPC fails on a non-success terminal state
   
   When a task FAILED / SKIPPED / etc. and the IPC send of the
   terminal-state message to the supervisor itself raised, the existing
   finally block logged the exception and let run() return normally.
   The task subprocess then exited with code 0, which the supervisor
   final_state property maps to SUCCESS for an exit_code-0 process
   without a _terminal_state (the supervisor never received the
   message). A genuine task FAILURE was silently being upgraded to
   SUCCESS on transient IPC failures, breaking downstream pipeline
   correctness without any signal.
   
   Exit non-zero from the finally block when the terminal state is
   anything other than SUCCESS, so the supervisor's final_state
   correctly classifies as FAILED (or UP_FOR_RETRY when retries are
   configured). The SUCCESS exemption preserves the existing softening
   for the legitimate scenario where the supervisor rejects the
   terminal-state send with a 409 because the server already
   terminalised the TI -- covered by
   test_run_swallows_supervisor_terminal_send_failure, which continues
   to pass.
   
   New regression test: when the task fails and the supervisor IPC send
   raises (BrokenPipeError simulating a dead Unix socket), run() now
   raises SystemExit(1).
   
   Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-006).
   
   * Address review comments: defer fail-closed exit, narrow guard
   
   - Narrow the fail-closed guard from `state != SUCCESS` to FAILED /
     UP_FOR_RETRY only (kaxil). SKIPPED / UP_FOR_RESCHEDULE / DEFERRED
     would otherwise be mismapped to FAILED by supervisor's final_state,
     which is strictly worse than the default mapping.
   
   - Stop calling sys.exit(1) inside run()'s finally — SystemExit is a
     BaseException so main()'s `except Exception:` would not catch it
     and finalize() would be skipped, silently dropping
     on_failure_callback / on_retry_callback / listener hooks /
     email_on_failure / email_on_retry on the same IPC failure (kaxil).
     Signal via `_terminal_state_send_failed` on the ti and let main()
     sys.exit(1) after finalize() has run.
   
   - Remove the redundant inline `from ... import run` in the test
     (Lee-W, kaxil) — `run` is already imported at module level.
   
   - Rework the existing regression test to assert the new contract
     (run() returns FAILED + sets the flag) and add a listener-based
     test that locks in callbacks/listeners still firing on the
     IPC-broken path (kaxil).
   
   * Declare _terminal_state_send_failed on RuntimeTaskInstance for mypy
   
   The fail-closed path in _handle_current_task_failure set 
ti._terminal_state_send_failed = True
   dynamically without a class-level declaration, so mypy raised [attr-defined].
   Add the field as a Pydantic PrivateAttr (bool, default False) matching the
   existing _cached_template_context pattern. No runtime behavior change —
   getattr(ti, '_terminal_state_send_failed', False) still returns False for
   the unset case, now via the PrivateAttr default instead of the getattr 
fallback.
   
   ---------
   (cherry picked from commit 7e91517ffd01f0608f21349ad8aa86595617d220)
   
   Co-authored-by: Jarek Potiuk <[email protected]>
   Co-authored-by: vatsrahul1001 <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to