jykae opened a new pull request, #67333:
URL: https://github.com/apache/airflow/pull/67333
Fix monitoring-pod leak in ``KubernetesJobOperator``.
``KubernetesJobOperator`` inherits from ``KubernetesPodOperator`` but
overrode ``execute()`` without ever invoking the parent's pod-cleanup path,
so the "monitoring" pods discovered via ``get_pods()`` (used to stream
logs and XCom while the Job runs) were never deleted. These pods are
created by Airflow, not by the ``V1Job`` controller, so they have no
``ownerReferences`` — neither ``ttl_seconds_after_finished`` nor the
foreground cascade on ``on_kill()`` reaped them. Every task run leaked
one pod per Job.
This PR makes pod cleanup symmetric with ``KubernetesPodOperator`` and
honours ``on_finish_action`` / ``on_kill_action`` for the discovered pods.
### Changes
**``operators/job.py``**
* ``execute()`` and ``execute_complete()`` now wrap their work in
``try/finally`` and call ``post_complete_action()`` for every pod
returned by ``get_pods()``. The inherited ``on_finish_action``
(``delete_pod`` / ``delete_succeeded_pod`` / ``delete_active_pod`` /
``keep_pod``) is now respected, matching ``KubernetesPodOperator``
semantics.
* ``on_kill()`` additionally calls ``pod_manager.delete_pod()`` for each
monitoring pod, gated by ``on_kill_action``. The Job's foreground
cascade does not reach these pods because they have no
``ownerReferences``. Unexpected ``ApiException``s are logged instead
of silently suppressed.
* ``execute_complete()`` resolves monitoring pods once and shares the
lookup between the log-retrieval and cleanup paths. Resolution is
best-effort — failures in the deferrable resume path no longer break
cleanup.
* Per-pod cleanup errors are logged but never mask the in-flight
exception, so Job-level failures continue to propagate unchanged.
**``triggers/job.py``**
* The trigger event now always includes ``pod_names`` /
``pod_namespace``, regardless of ``get_logs``. This guarantees
``execute_complete()`` can reliably clean up monitoring pods even
when log streaming is disabled.
**``docs/operators.rst``**
New section documenting the cleanup contract: which pods are affected,
the meaning of each ``on_finish_action`` value for monitoring pods, and
the ``on_kill_action`` behaviour.
**Tests**
* Coverage for each ``on_finish_action`` value (``delete_pod``,
``delete_succeeded_pod``, ``delete_active_pod``, ``keep_pod``) on
both success and failure paths.
* Coverage for ``on_kill_action`` (``delete_pod`` / ``keep_pod``).
* Regression test for the deferrable ``get_logs=False`` path.
* New mocks use ``spec`` / ``autospec`` to catch attribute typos
against the real ``kubernetes`` client surface.
### Backwards compatibility
Default ``on_finish_action`` is unchanged (``delete_pod``), so existing
deployments will start reclaiming the leaked monitoring pods
automatically. Users who relied on monitoring pods surviving the task
(e.g. for offline log inspection) can opt in explicitly by passing
``on_finish_action="keep_pod"``.
### How to verify
1. Run a DAG using ``KubernetesJobOperator`` with default settings.
2. After the task finishes, both the ``V1Job``'s child pod and the
monitoring pod (label ``airflow_kpo_in_cluster=True``, no
``ownerReferences``) should be gone.
3. Repeat with ``on_finish_action="delete_succeeded_pod"`` and a
failing command — the monitoring pod should remain for forensics.
4. Repeat with ``on_finish_action="keep_pod"`` — both pods should
remain.
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes — GitHub Copilot (Claude Opus 4.7)
Generated-by: GitHub Copilot (Claude Opus 4.7) following [the
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]