potiuk commented on code in PR #66573:
URL: https://github.com/apache/airflow/pull/66573#discussion_r3264979900
##########
task-sdk/src/airflow/sdk/execution_time/task_runner.py:
##########
@@ -1426,6 +1426,25 @@ def _on_term(signum, frame):
"Failed to report terminal task state to supervisor",
state=state.value,
)
+ # Fail closed for non-success terminal states: when the
+ # supervisor never receives the terminal-state message,
+ # exiting 0 would let the supervisor's final_state property
+ # default to SUCCESS (exit_code == 0 with no _terminal_state
+ # set). For a task that actually FAILED / was SKIPPED / etc.,
+ # that turns an IPC blip into silent data-quality breakage
+ # for every downstream task. Exit non-zero so the
+ # supervisor's final_state correctly classifies this as
+ # FAILED (or UP_FOR_RETRY when retries are configured).
+ #
+ # SUCCESS is exempt: a "send the SUCCESS marker, supervisor
+ # rejects with 409 because the server already terminalised
+ # this TI" path is the legitimate scenario the existing
+ # softening targets. In that path the local state is SUCCESS
+ # and the supervisor's default-to-SUCCESS coincides with
+ # reality, so we continue to finalize() so listeners observe
+ # the task state.
+ if state != TaskInstanceState.SUCCESS:
+ sys.exit(1)
Review Comment:
Good push-back — re-thought this with kaxil's input. The original
`sys.exit(1)` was wrong in two ways (raises `SystemExit`, which `main()`'s
`except Exception:` doesn't catch, so `finalize()` would be skipped and
callbacks/listeners/emails silently dropped on the same IPC failure). I've
reworked it in 1b2449e:
- `run()` no longer exits. It sets `_terminal_state_send_failed` on the ti.
- `main()` calls `sys.exit(1)` only after `finalize()` returns, so
listeners/callbacks observe the FAILED state first.
- Guard narrowed to FAILED / UP_FOR_RETRY (per kaxil) — SKIPPED /
UP_FOR_RESCHEDULE / DEFERRED would otherwise be misclassified as FAILED by the
supervisor.
On your other point: without this signal, `finalize()` doesn't itself
surface the IPC failure to `main()` — the `_xcom_push_to_db` and
`SetRenderedFields` paths through `SUPERVISOR_COMMS` are both already wrapped
in `try/except`, and the callback/listener/email code paths run in-process and
don't touch the supervisor socket — so `main()` returns normally with exit 0,
and the supervisor's `final_state` defaults to SUCCESS. That's the
silent-correctness bug this PR fixes.
---
Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]