jekeanyanwu opened a new pull request, #66716:
URL: https://github.com/apache/airflow/pull/66716
<!-- closes: #ISSUE -->
related: #66715
## Problem
When `KubernetesPodOperator` runs in deferrable mode and the Kubernetes
garbage collector deletes the pod in the window between the trigger firing a
success/error/timeout event and the worker re-entering the task,
`trigger_reentry` crashes.
The unguarded `self.pod = self.hook.get_pod(pod_name, pod_namespace)` raises
`ApiException(404)` and escapes `trigger_reentry`. On provider versions before
#56976 (which added `if self.pod is None: return` to `_clean`), the `finally`
block additionally crashes with `AttributeError: 'NoneType' object has no
attribute 'metadata'`, masking the original cause.
The existing dead-code branch right below the call:
```python
self.pod = self.hook.get_pod(pod_name, pod_namespace)
if not self.pod:
raise PodNotFoundException("Could not find pod after resuming from
deferral")
```
was clearly intended to handle this, but `hook.get_pod()` raises rather than
returning `None`, so the translation never happens.
Real-world traceback (we hit this routinely on a Kubernetes cluster that
aggressively reclaims completed pods):
```
File ".../operators/pod.py", line 834, in trigger_reentry
self.pod = self.hook.get_pod(pod_name, pod_namespace)
kubernetes.client.exceptions.ApiException: (404) Not Found
{"message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\"
not found", ...}
During handling of the above exception, another exception occurred:
...
File ".../operators/pod.py", line 905, in _clean
self.pod = self.pod_manager.await_pod_completion(...)
File ".../utils/pod_manager.py", line 808, in read_pod
return self._client.read_namespaced_pod(pod.metadata.name,
pod.metadata.namespace)
AttributeError: 'NoneType' object has no attribute 'metadata'
```
The trigger had already emitted `status: success` — the pod ran to
completion successfully, was GC'd, and the worker resumed only to fail the task.
## Solution
Wrap the `get_pod` call so that:
- **Non-404 `ApiException`** re-raises unchanged.
- **404 + `event["status"] == "success"`** logs a warning and returns. The
trigger already observed the pod completed successfully; logs/XCom are
unrecoverable but the task itself succeeded, so retrying is wrong.
- **404 + non-success event** raises `PodNotFoundException`, matching the
existing dead-code intent.
The pre-existing `if not self.pod:` branch is kept as a defensive guard for
any subclass override that returns `None` instead of raising.
## Tests
Three new unit tests in `TestKubernetesPodOperatorAsync` covering the three
branches:
- `test_async_trigger_reentry_returns_when_pod_gcd_on_success`
- `test_async_trigger_reentry_raises_pod_not_found_on_failure`
- `test_async_trigger_reentry_propagates_non_404_api_exception`
```
uv run --project providers/cncf/kubernetes pytest \
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_pod.py::TestKubernetesPodOperatorAsync
\
-k trigger_reentry -q
```
## Related prior work
- #39296 added `(HTTPError, ApiException)` handling around `_write_logs()` —
runs after the unguarded `get_pod()`, doesn't help.
- #56976 added `if self.pod is None: return` to `_clean` — runs after,
doesn't help.
- This PR closes the remaining gap at the unguarded `get_pod()` call itself.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]