jykae opened a new pull request, #67333:
URL: https://github.com/apache/airflow/pull/67333

   Fix monitoring-pod leak in ``KubernetesJobOperator``.
   
   ``KubernetesJobOperator`` inherits from ``KubernetesPodOperator`` but
   overrode ``execute()`` without ever invoking the parent's pod-cleanup path,
   so the "monitoring" pods discovered via ``get_pods()`` (used to stream
   logs and XCom while the Job runs) were never deleted. These pods are
   created by Airflow, not by the ``V1Job`` controller, so they have no
   ``ownerReferences`` — neither ``ttl_seconds_after_finished`` nor the
   foreground cascade on ``on_kill()`` reaped them. Every task run leaked
   one pod per Job.
   
   This PR makes pod cleanup symmetric with ``KubernetesPodOperator`` and
   honours ``on_finish_action`` / ``on_kill_action`` for the discovered pods.
   
   ### Changes
   
   **``operators/job.py``**
   
   * ``execute()`` and ``execute_complete()`` now wrap their work in
     ``try/finally`` and call ``post_complete_action()`` for every pod
     returned by ``get_pods()``. The inherited ``on_finish_action``
     (``delete_pod`` / ``delete_succeeded_pod`` / ``delete_active_pod`` /
     ``keep_pod``) is now respected, matching ``KubernetesPodOperator``
     semantics.
   * ``on_kill()`` additionally calls ``pod_manager.delete_pod()`` for each
     monitoring pod, gated by ``on_kill_action``. The Job's foreground
     cascade does not reach these pods because they have no
     ``ownerReferences``. Unexpected ``ApiException``s are logged instead
     of silently suppressed.
   * ``execute_complete()`` resolves monitoring pods once and shares the
     lookup between the log-retrieval and cleanup paths. Resolution is
     best-effort — failures in the deferrable resume path no longer break
     cleanup.
   * Per-pod cleanup errors are logged but never mask the in-flight
     exception, so Job-level failures continue to propagate unchanged.
   
   **``triggers/job.py``**
   
   * The trigger event now always includes ``pod_names`` /
     ``pod_namespace``, regardless of ``get_logs``. This guarantees
     ``execute_complete()`` can reliably clean up monitoring pods even
     when log streaming is disabled.
   
   **``docs/operators.rst``**
   
   New section documenting the cleanup contract: which pods are affected,
   the meaning of each ``on_finish_action`` value for monitoring pods, and
   the ``on_kill_action`` behaviour.
   
   **Tests**
   
   * Coverage for each ``on_finish_action`` value (``delete_pod``,
     ``delete_succeeded_pod``, ``delete_active_pod``, ``keep_pod``) on
     both success and failure paths.
   * Coverage for ``on_kill_action`` (``delete_pod`` / ``keep_pod``).
   * Regression test for the deferrable ``get_logs=False`` path.
   * New mocks use ``spec`` / ``autospec`` to catch attribute typos
     against the real ``kubernetes`` client surface.
   
   ### Backwards compatibility
   
   Default ``on_finish_action`` is unchanged (``delete_pod``), so existing
   deployments will start reclaiming the leaked monitoring pods
   automatically. Users who relied on monitoring pods surviving the task
   (e.g. for offline log inspection) can opt in explicitly by passing
   ``on_finish_action="keep_pod"``.
   
   ### How to verify
   
   1. Run a DAG using ``KubernetesJobOperator`` with default settings.
   2. After the task finishes, both the ``V1Job``'s child pod and the
      monitoring pod (label ``airflow_kpo_in_cluster=True``, no
      ``ownerReferences``) should be gone.
   3. Repeat with ``on_finish_action="delete_succeeded_pod"`` and a
      failing command — the monitoring pod should remain for forensics.
   4. Repeat with ``on_finish_action="keep_pod"`` — both pods should
      remain.
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes — GitHub Copilot (Claude Opus 4.7)
   
   Generated-by: GitHub Copilot (Claude Opus 4.7) following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to