pmcquighan-camus opened a new issue, #56693: URL: https://github.com/apache/airflow/issues/56693
### Apache Airflow Provider(s) cncf-kubernetes ### Versions of Apache Airflow Providers apache-airflow-providers-cncf-kubernetes==10.7.0 ### Apache Airflow version 3.0.6 ### Operating System debian 12 ### Deployment Official Apache Airflow Helm Chart ### Deployment details Running on GKE , kubernetes version 1.33 ### What happened A job with parallelism 1 and 1 completion (i.e. just running a single pod to completion) completed successfully. The triggerer detected the job completion, but before the task was restarted GKE deleted the pod for a node scaling event. Since the pod is `Complete` the Job is also considered `Complete` and so kubernetes will not retry the pod or anything. Then, when the task wakes up, it fails when trying to `resume_execution`, notably when trying to fetch logs. The worst part is that on *task retries* the operator sees "job is completed" and tries to resume from `execute_complete` and hits the same pod not found error again (instead of perhaps retrying the Job from the start). ``` [2025-10-15, 09:48:22] ERROR - Task failed with exception: source="task" ApiException: (404) Reason: Not Found File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 920 in run File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1215 in _execute_task File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py", line 1606 in resume_execution File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/job.py", line 276 in execute_complete File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 470 in get_pod File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 23999 in read_namespaced_pod File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api/core_v1_api.py", line 24086 in read_namespaced_pod_with_http_info File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348 in call_api File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180 in __call_api File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 373 in request File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 244 in GET File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 238 in request ``` I think a primary workaround is to set `get_logs=False`, but I'm not totally certain that this workaround fixes all cases where a PodNotFound might occur. Also note that the None-check on getting the pod [here](https://github.com/apache/airflow/blob/providers-cncf-kubernetes/10.7.0/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py#L276-L278) is not hit since the method `get_pod` ends up throwing a kubernetes.client.ApiException. I tried patching the code to catch that exception and rethrow the `PodNotFoundException`, but that had no effect. This feels similar to, but not fixed by https://github.com/apache/airflow/issues/39239, notably a task retry does not result in a successful execution. ### What you think should happen instead I think failing with PodNotFoundException for the task when `get_logs=True` is reasonable, however it seems like a task retry should then result in the full task being retried instead of just re-running `execute_complete` and failing on the same exception multiple times. This behavior seemed to occur regardless of if the kubernetes Job object still remained either. ### How to reproduce Run a KubernetesJobOperator that does anything, and once the pod completes (but prior to airflow fetching logs/marking the task complete), manually delete the pod. In an actual cloud-hosted Kubernetes environment, a cluster-autoscaling component might result in the pod being deleted, but it is hard to rely on that so a manual delete mimics the same behavior. ### Anything else _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
